Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7)
manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a 1
, and add padding 1+0
. Subsequent bytes start 10
.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
This fits in the 0x00000800 - 0x0000FFFF
range (0x4E3E < 0xFFFF
), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E
is 100111000111110b
. Drop the bits into the x
above (start from the right, we'll fill in missing bits at the start with 0
):
1110x100 10111000 10111110
There is an x
spot left over at the start, fill it in with 0
:
11100100 10111000 10111110
Convert from bits to hex:
0xE4 0xB8 0xBE