Unicode Primer

By Najaf Ali

When you first start out programming you don't really care about character encoding. A string is a string, and as long as it works and behaves as expected you have no reason to worry.

Sooner or later you will run into character encoding issues. The common wisdom is to switch every character encoding option you can find to UTF-8 and then hope for the best. This works up to a point, but you're not really prepared for anything more complex than debugging database error messages or figuring out why your website isn't rendering text properly.

Hopefully, by the end of this article, you'll have a good understanding of unicode and character encodings and will be better able to debug non-trivial issues. I tried to write the article I wish I could have read starting out.

What exactly is Unicode?

Unicode is a mapping from all of the characters from all of the known writing systems of mankind to hexadecimal values called code points.

By 'characters' we mean the abstract concept of a character, for example, a lowercase 'a' or an upper case 'B'.

By hexadecimal values we mean base-16 numbers like 0x00 or 0xF3. If you're not intuitively comfortable with hex, take five minutes and get it handled, you'll thank yourself later.

Unicode maps maps characters to the hex values between 0x0000 and 0x10FFFF. In decimal this amounts to 1,114,112 code points.

Here are some example characters and the their unicode code points:

char  codepoint

C     0x0043
5     0x0035
%     0x0025
本    0x672C

Unicode covers a lot of characters. To make it a little easier to manage, it's broken down into code pages that group related characters together. All of the code pages are available as charts on the unicode site, here's the Latin alphabet (the bit that looks like ASCII).

Unicode says nothing about how a code point is represented in a computers memory. It's a purely abstract construct.

How are unicode points stored in memory?

For Unicode to be of any use to us as programmers, we need some way of representing it in a computer. To do this, we have character encodings that allow us to map between bytes in memory and unicode code points. Popular unicode encodings include UTF-8, UTF-16 and UTF-32.

Let's have a quick think about persisting hexadecimal values.

A single digit in hex like 0x3 takes exactly four bits (or half a byte) to represent in binary (0101). I would take a minute to convince yourself of this, otherwise the rest of this article won't make much sense.

A two-digit hex value like 0xF2 could fit into a single byte on disk as 11110010

For four hex digits, you'd need two bytes, e.g. 0x672C could be encoded as 01100111 00101100.

Following from that, for six hex digits you would need three bytes. Since the highest unicode code point is 0x10FFFF (six hex digits), three bytes is all we need to encode into binary every unicode code point.

The encoding scheme we've defined here is a very rough approximation of UTF-32 (albeit, with one less byte).

Here are the unicode characters we showed above with their UTF-32 binary encodings:

char  codepoint  UTF-32 Binary

C     0x0043     00000000 00000000 00000000 01000011
5     0x0035     00000000 00000000 00000000 00110101
%     0x0025     00000000 00000000 00000000 00100101
本    0x672C     00000000 00000000 01100111 00101100

UTF-32 is one of a few ways of encoding unicode points into binary (so that it can be stored in the memory of a computer). UTF-32 is a fixed length encoding, in other words, every character takes up 4 bytes (32 bits) in memory.

As you can see from the above example, reserving four bytes for every single character wastes a lot of space, especially if the majority of your characters are in the latin code page and only ever use the first byte.

UTF-8 is a variable length unicode encoding. It encodes Unicode code points into between one and four bytes in memory. For this to work it's somewhat more complex than storing the hexadecimal value of codepoints in binary.

UTF-8 saves space by only using the number of bytes required for storing each character. Here is the UTF-8 binary for our example characters again:

char  codepoint  UTF-8 Binary

C     0x0043     01000011
5     0x0035     00110101
%     0x0025     00100101
本    0x672C     11100110 10011100 10101100

The first three characters are latin and fit into a single byte. Our Chinese character on the other hand is no longer a simple representation of the it's code points hex value.

Note: Why the hell does UTF-32 take four bytes if the highest unicode point can only ever require three? Based on the related wikipedia article it looks like Unicode used to be a little bigger. Also, the first 11 'unused' bits in a UTF-32 char are sometimes used for non-unicode related data.

Can you tell the character encoding of a given string?

Not really... let's say you have a string from some source. In your editor, programs output and what have you, it looks to you like this:

Hello world

In reality, you have a machine with some voltage differences in it that represent binary data. A less confusing way to visualize it would be like this:

01001000 01100101 01101100 01101100 01101111 00100000 
01010111 01101111 01110010 01101100 01100100 00100001 

How would you figure out the encoding? It's just 1s and 0s in a machine.

Some libraries are able to make a guess at your encoding, or at least suggest one amongst some candidates with a confidence threshold.

For the most part, we rely on meta-data from wherever we got the string. Databases and web-servers usually give some indication of the encoding of data they emit. For websites you can inspect the content-type headers for example.

This is by no means the actual encoding . A database which has all it's options switched to UTF-8 will happily store and emit Shift-JIS if a developer who doesn't convert data before inserting puts it there. (The only issues you'll run into are operations on the data that expect it to be in a given encoding. Input and output won't necessarily be a problem).

Data is data is data, it's all just 1s and 0s in a machine. No string knows anything about how it has been encoded.

Is there a way of converting a string from one encoding to another?

Yes... but it's not always going to be possible.

Using the iconv library and cli tool, you can convert a string from one encoding to another.

For example, this is how you might convert a string from UTF-8 to UTF-16:

echo "Hello World" | iconv -f UTF-8 -t UTF-16

You will need to know the strings current encoding to do this.

If this isn't obvious, think about how you would implement the conversion. Your input is a random sequence of bytes who you know nothing about. Your output is another sequence of bytes in some target encoding that you do know. If you don't know the source encoding, how do you know what to convert it from?

For Unicode encodings like UTF-8, UTF-16 and UTF-32, converting between the encodings is lossless because they are unicode encodings, i.e. each of them can be used to map to any unicode code point.

For other encodings however, all bet's are off. Try this:

echo "本" | iconv -f UTF-8 -t ASCII

This attempts to convert the string "本" from UTF-8 to ASCII. Since there's no way to represent the character in ASCII, iconv throws an error and dies.

The following however will work just fine:

echo "Hello World" | iconv -f UTF-8 -t ASCII

If it's just binary data, then why does it show up as letters and numbers on my screen?

Fonts. Unicode fonts know how to render sequences of unicode code points to your screen.

Character encodings happen at the layer below that, i.e. decoding binary data into unicode points that fonts can then render.

Wrap Up

The most important thing to keep in mind when working with text in computers is that it's just binary data. Any assertion about how to decode it into unicode is at best taken on trust and at worst a wild-arse guess.

Most non-trivial issues you run into are going to be because you or a part of your system thinks a string is in one encoding but is actually in another.