Character codes and encoding
In the beginning, there was ASCII. (There were others, but we begin here with ASCII).
7-bit ASCII in an 8-bit package. Using only the first seven bits of a byte, standard ASCII could not deal with diacritical marks (accents and funny dots in the vulgar vernacular). Therefore, the German word for later, später, would sometimes be transliterated into spaeter. I suspect this started in the days before computers when typewriters had only so many keys. You had to make do.
The computer age began before most of us were born. And with it was a need to encode characters into bits. And since there exists more than the simple “unaccented” latin characters in the world, other standards emerged. However, most surviving systems use the basic 7-bit ASCII characters as their starting point.
Listing 1. später transliterated into spaeter
Name: s p a e t e r
----- ----- ----- ----- ----- ----- -----
UTF-16: 00 73 00 70 00 61 00 65 00 74 00 65 00 72
UTF-8: 73 70 61 65 74 65 72
ISO-8859-1: 73 70 61 65 74 65 72
Windows-1252: 73 70 61 65 74 65 72
Mac OS Roman: 73 70 61 65 74 65 72
ISO-8859-1, Windows-1252 and Mac OS Roman are all 8-bit code tables. That is, every character defined in these standards is represented by 8-bits. These represent characters used by most of the western European languages.
In contrast, UTF-16 is represented by two bytes and UTF-8 is represented by one or more bytes.
Basic ASCII uses only the lower 127 code-points in a byte (ergo, the reference to 7-bit ASCII). But there are 8-bits in a byte. That leaves the upper half of the code-points open to accomodate other characters.
ISO-8859-1 and Windows-1252 have the same encoding for ä but Mac OS Roman does not. This historical reality has caused problems since the early DOS/MAC days.
UTF-8 and UTF-16 are encoding schemes. Whereas ISO-8859-1, Windows-1252 and Mac OS Roman have character code-points and character encodings that coincide (are the same), the UTF encoding schemes are simply character encoding schemes for Unicode.
Unicode?
Unicode defines characters in a mega table. Most characters in Unicode are defined by a 16-bit code point. Not surprisingly, UTF-16 character encoding is very often the same as the Unicode code-point. UTF-8, on the other hand, is not. Notice that both UTF-8 and UTF-16 use two bytes to encode ä but that the encoding is different.
Listing 2. später using a single code-point for ä
Name: s p ä t e r
----- ----- ----- ----- ----- -----
UTF-16: 00 73 00 70 00 E4 00 74 00 65 00 72
UTF-8: 73 70 C3 A4 74 65 72
ISO-8859-1: 73 70 E4 74 65 72
Windows-1252: 73 70 E4 74 65 72
Mac OS Roman: 73 70 A8 74 65 72
In addition to character code-points with “accents”, Unicode also defines diacritical marks. Some diacritics are combining and some are not.
Diacritic what?
Diacritics are “accents” which are defined separately from the actual character. When a character is preceded or followed by a combining diacritical mark, the two are combined to form a single character. In the case of ä, the character ‘a’ plus the diacritical combining mark ‘¨’ are combined by the rendering engine (e.g., the browser) into a single character ‘ä’. (See also, glyph).
Not all diacritics are “accents” nor are they combining. The apostrophe is considered a non-combining diacritical mark and is not an “accent”. For example, the apostrophe in “I have Carol’s book” is used to denote possesion rather than pronuciation.
The 8-bit code tables do not have a separate code-point for ‘¨’ and cannot represent it. MySQL’s default character encoding when creating tables is ISO-8859-1. You can change the defaults but out of the box, it’s ISO-8859-1. Most XML I deal with is UTF-8 encoded.
If you have a really clever conversion routine, you can convert später from UTF-8 into ISO-8859-1. I don’t recommend it. Eventually, you will run into characters that simply cannot be mapped into the 8-bit code space of ISO-8859-1. How prevalent is this problem? Consider the musicians Sinéad O’Connor, Björk, and 松居和.
Notice, again, the difference between UTF-8 and UTF-16 character encodings.
Listing 3. später using a combinine diacritical mark, a + ¨ --> ä
Name: s p ä t e r
----- ----- ------------- ----- ----- -----
UTF-16: 00 73 00 70 00 61 03 08 00 74 00 65 00 72
UTF-8: 73 70 61 CC 88 74 65 72
ISO-8859-1: 73 70 61 ? 74 65 72
Windows-1252: 73 70 61 ? 74 65 72
Mac OS Roman: 73 70 61 ? 74 65 72
There are other 8-bit code tables for many of the phonetic alphabets. The english word ‘later’ translated into Russian is ‘позже’. Each 8-bit code table is tied to a specific alphabet.
List of ISO-8859 code tables.
ISO-8859-1 Latin-1, Western European
ISO-8859-2 Latin-2, Central European
ISO-8859-3 Latin-3, South European
ISO-8859-4 Latin-4, North European
ISO-8859-5 Latin/Cyrillic
ISO-8859-6 Latin/Arabic
ISO-8859-7 Latin/Greek
ISO-8859-8 Latin/Hebrew
ISO-8859-9 Latin-5, Turkish
ISO-8859-10 Latin-6, Nordic
ISO-8859-11 Latin/Thai
ISO-8859-12 -- abandoned --
ISO-8859-13 Latin-7, Baltic Rim
ISO-8859-14 Latin-8, Celtic
ISO-8859-15 Latin-9
ISO-8859-16 Latin-10, South-Eastern European
In each of these ISO-8859 tables, the first 127 characters are identical with 7-bit ASCII.
There are separate code tables for ISO-8859, Windows and Mac since ‘позже’ cannot be represented in their Latin-1 equivalents.
Listing 4. позже
Name: п о з ж е
----- ----- ----- ----- -----
UTF-16: 04 3F 04 3E 04 37 04 36 04 35
UTF-8: D0 BF D0 BE D0 B7 D0 B6 D0 B5
ISO-8859-5: DF DE D7 D6 D5
Windows-1251: EF EE E7 E6 E5
Mac-Cyrillic: EF EE E7 E6 E5
How, then, can you encode a document using one of the ISO-8859 tables that contains the three words ‘later’, ‘später’ and ‘позже’?
You can’t. Not with the 8-bit tables. But Unicode is different. It is a unifying code table with more than 100,000 characters.
The 8-bit tables combine both the code-point and the character encoding. That is, the code point for the letter ‘a’ in ISO-8859-1 is 0x61 and the character encoding for ‘a’–the bit pattern used to represent ‘a’ in the computer–is also 0x61.
Unicode is not a character encoding. It is strictly a character code table. Therefore, Unicode needs an encoding scheme to represent the character as a bit pattern. Since the code table is not tied to any character encoding, any suitable encoding scheme may be used. Many have emerged, most notably UTF-8 and UTF-16.
Eventually, you run into a scenario where none of the ISO-8859 tables will work for you. Consider devanagari, the script used in Hindi. The ISO-8859 table for devanagari was abandoned and India uses their own table, ISCII. (Like ASCII but with an ‘I’). The Hindi word for later is ‘बाद में’. Devanagari makes liberal use of diacritical marks.
Listing 5. बाद में
Name: ब ा द में
-------- -------- -------- ----- ----------------------------
UTF-16: 09 2C 09 3E 09 26 00 20 09 2E 09 47 09 02
UTF-8: E0 A4 AC E0 A4 BE E0 A4 A6 20 E0 A4 AE E0 A5 87 E0 A4 82
ISCII: DA EA D4 20 DC F1 B2
By now, you get the point. If you must deal with various character sets, Unicode is your friend. That’s great. Wow! All those phonetic alphabets in one table. “I bet this works great for Chinese character”, you say? It does but it has yet to catch on in the Chinese speaking world.
Chinese characters come in two flavors: Traditional and Simplified. The english word ‘later’ is 稍後 in traditional characters and 稍后 in simplified characters. It is the same word, the same pronuciation but a different rendering.
Chinese speaking countries have adopted either Traditional or Simplified characters as their national standard. Many characters are visually the same in both standards (e.g., 稍) while others are not (e.g., 後 and 后). And there are numerous character encoding schemes that exist solely for the Chinese character sets. Two standards emerge as the leaders.
Traditional character countries gravitate towards Big5 while Simplified character countries lean towards GB2312. These are fixed width, 16-bit code tables.
Notice, however, that both Big5 and GB2312 have code points for and character encodings for both traditional and simplified characters.
Listing 6. 稍後
Name: 稍 後
-------- --------
UTF-16: 7A 0D 5F 8C
UTF-8: E7 A8 8D E5 BE 8C
Big5: B5 78 AB E1
GB2312: C9 D4 E1 E1
Listing 7. 稍后
Name: 稍 后
-------- --------
UTF-16: 7A 0D 54 0E
UTF-8: E7 A8 8D E5 90 8E
Big5: B5 78 A6 59
GB2312: C9 D4 BA F2
You will find that UTF-8 character encoding for Chineses characters generally require three bytes (or more) whereas Big5 and GB2312 require only two. This becomes a serious issue for web application internationalization (i18n). If you create MySQL tables to use UTF-8, you will need to translate any differing character encoding into UTF-8. This isn’t so much a problem with XML (as it is usually UTF-8, in my experience) or with your own website (as you have control over the front end.
Lastly, Unicode is not a 16-bit table. It’s bit independent. When more space is needed, the Unicode folks will add more table space. For example, the Unicode code points for Tai Xuan Jing Symbols (under Miscellaneous Symbols and Dingbats) define a code-point for the symbol for eternity at U+1D33A. The system fonts for Macintosh include a rendering for this symbol. I don’t know if they exist on other platforms so I include a picture of it.
Listing 8. (TETRAGRAM FOR ETERNITY)
UTF-16: D8 34 DF 3A
UTF-8: F0 9D 8C BA
And there you have it. Character codes and encoding. But this isn’t the end of the story. To complete the round trip from web page to database to web page, you will need to contend with URL encoding and configuring your webapp. But that will need to wait for another day.