unicode – Red Leopard

Unicode Backlash

kelly — Fri, 28 Aug 2009 20:12:08 +0000

Yesterday, I stumbled upon a blog that made me laugh. The truth be told, I have a snarky side—my evil twin, if you will. I keep it in check. Mostly. Some of you who will understand. Some of you won’t. It’s a Gemini thing.

My twin is fun but good luck putting it back in its cage. And believe me, for every dollar of fun you get, you’ll pay ten dollars in social disaster. Best to keep the twin in check.

But yesterday, I stumbled upon Ted Dziuba’s blog and—holy smokes—my twin was rattling its cage! Ted’s rant on unicode is spot on. I’ll further say, unicode is super-important only to people for whom unicode is super-important. If you’re backend services only understand ASCII then unicode is anti-important.

For example, YouTube is happy with 无法停下 as are most of the online music services. But that cost money, believe it. If you can make do with ASCII, you’ll save money and a lot of headache.

Oh, how did I stumble upon Ted’s place? Saw his scalability article in the Hot Links list on highscalability.

My twin rip snorted.

Verdana Hates Pinyin

kelly — Sat, 08 Aug 2009 22:09:16 +0000

I stumbled across an article on lostlaowai.com

www.lostlaowai.com/survival-chinese

which lead me to poke around the site a bit. At the above URL, I noticed that some of the combining diacritical marks (tone marks) used in writing pinyin were not rendering properly. I had not seen this problem before. It didn’t make sense.

Things that don’t make sense bug me. And being something of a character geek, I couldn’t let it go. So I tried to reproduce the problem in a test example. I couldn’t. That’s when I discovered a quirky Mac OS X copy+paste issue. I sensed there was a problem but the truth was elusive. You can’t see that copy+paste changes the string characters unless you look at a binary dump of the file (which I did).

Okay, the mandarin word for ‘good’ is 好 and in pinyin is written ‘hǎo’. It’s possible to write the pinyin using codpoints from just the unicode Latin block.

Latin Extended-B (Latin)
latin small letter a with caron
Unicode  01CE
UTF-8    C7 8E

   h    ǎ    o
0068 01CE 006F

It’s also possible to write the pinyin using Combining Diacritical Marks.

Combining Diacritical Marks (Combining Marks)
combining caron
Unicode  030C
UTF-8    CC 8C

   h    a 030C    o
0068 0061 030C 006F

Note that the combining mark comes after the character it decorates. This is in contrast to Mac OS X’s U.S. Extended Keyboard input method which preceeds the character to decorate with a modifier letter. However, the modifier letter is not a combining mark. You cannot create a byte sequence that a browser renders as hǎo, it will come out as hˇao.

Spacing Modifier Letters (Modifier Letters)
caron
Unicode  02C7
UTF-8    C8 87

   h 02C7    a    o
0068 02C7 0061 006F

NOTE: the caron does not combine with the a; OS X does not
modify the 'a' to have a caron above.

OS X input method uses the modifier letter to lookup an equivalent codepoint in unicode’s latin block.

Using OS X's US Extended Keyboard Input Method
opt-v + a

   h 02C7    a    o              h    ǎ    o
0068 02C7 0061 006F    ==>    0068 01CE 006F

Note: the caron combines with the a; OS X automatically
converts 02C7 + 0061 into 01CE.

To check the code points, I used this handy tool:

people.w3.org/rishida/scripts/uniview/conversion.php

open the OS X character pallete
Go to the URL above
place the cursor in the upper left box labeled Characters
type the letter h into the box
type the letter a into the box
from character pallete, insert character 030C into the box
type the letter o into the box
click the convert button just above the Characters box, the UTF-16 Code units box will have the sequence (in unicode code points) 0068 0061 030C 006F
select and copy (cmd+c) the contents of the Characters box
immediately paste contents back into the Characters box
click the convert button just above the Characters box, the UTF-16 Code units box now has the sequence 0068 01CE 006F

Aha! The copy and paste operation changed the string’s character code points! Imagine my surprise.

That mystery solved, I next dove into the lostlaowai source code. This was my first encounter with using character entity encoding of the combining diacritical marks. Rather than type the characters directly into the source code, like this

hǎo

lostlaowai encoded the non-ascii characters like this

hǎo

even though the page encoding was declared as UTF-8

Maybe it’s a joomla thing. lostlaowai uses joomla.

After a quick bout of deleting blocks of source code, I isolated the culprit!



  
  wonderful.html



   好極了！
1. hǎo jíle!
2. hǎo jíle!

3. hǎo jíle!
4. hǎo jíle!

5. hǎo jíle!
6. hǎo jíle!

Source code: wonderful.html

Adding Verdana to the font family causes the problem. I searched to see if anyone else had seen this problem. Indeed. Wikipedia.org has en entry on a similar bug and fileformat.info lists the five marks supported by Verdana. That’s sad. Verdana only supports 5 of the 112 code points in unicode’s Combining Diacritical Marks block.

The Verdana typeface, released in 1996, was created for and is owned by Microsoft. If Microsoft hasn’t fixed Verdana after more than a decade, I’ll assume they never will and prudence suggests avoid it.

At least avoid using Verdana in writing pinyin using combining diacritical marks. If you must use Verdana, then use codepoints from unicode’s latin block. On the Mac, this is the default when typing these characters in directly using the U.S. Extended keyboard.

Character     ā    á    ǎ    à
Unicode    0101 00E1 01CE 00E0
------------------------------
Character     ē    é    ě    è
Unicode    0113 00E9 0118 00E8
------------------------------
Character     ī    í    ǐ    ì
Unicode    0128 00ED 01D0 00EC
------------------------------
Character     ō    ó    ǒ    ò
Unicode    014D 00F3 0102 00F2
------------------------------
Character     ū    ú    ǔ    ù
Unicode    0168 00FA 01D4 00F9
------------------------------
Character     ǖ    ǘ    ǖ    ǜ
Unicode    01D6 01D8 01D6 01DC

If you have to convert an existing web page (like the lostlaowai page mentioned above), you could take advantage of the copy+paste quirk in OS X. Simply open the web page, copy the pinyin and paste it into a text editor (e.g., back into the source). The original text is not rendered properly but that’s ok. The character codes are correct. When you paste it into the editor, OS X will convert the the char+mark into a single char from the latin code block.

Finally, the character ‘a’ in pinyin is sometimes written using using the unicode codepoint 0251 ‘ɑ’ which is still in the latin block but in the section called “IPA Extensions”. It has a different look from the standard ascii character ‘a’. There is no set codepoints that replace the accented characters in the chart above.

Character codes and encoding

kelly — Thu, 11 Dec 2008 15:43:48 +0000

Character codes and encoding

In the beginning, there was ASCII. (There were others, but we begin here with ASCII).

7-bit ASCII in an 8-bit package. Using only the first seven bits of a byte, standard ASCII could not deal with diacritical marks (accents and funny dots in the vulgar vernacular). Therefore, the German word for later, später, would sometimes be transliterated into spaeter. I suspect this started in the days before computers when typewriters had only so many keys. You had to make do.

The computer age began before most of us were born. And with it was a need to encode characters into bits. And since there exists more than the simple “unaccented” latin characters in the world, other standards emerged. However, most surviving systems use the basic 7-bit ASCII characters as their starting point.

Listing 1. später transliterated into spaeter

Name:   s       p       a       e       t       e       r
              -----   -----   -----   -----   -----   -----   -----
      UTF-16: 00 73   00 70   00 61   00 65   00 74   00 65   00 72
       UTF-8:    73      70      61      65      74      65      72
  ISO-8859-1:    73      70      61      65      74      65      72
Windows-1252:    73      70      61      65      74      65      72
Mac OS Roman:    73      70      61      65      74      65      72

ISO-8859-1, Windows-1252 and Mac OS Roman are all 8-bit code tables. That is, every character defined in these standards is represented by 8-bits. These represent characters used by most of the western European languages.

In contrast, UTF-16 is represented by two bytes and UTF-8 is represented by one or more bytes.

Basic ASCII uses only the lower 127 code-points in a byte (ergo, the reference to 7-bit ASCII). But there are 8-bits in a byte. That leaves the upper half of the code-points open to accomodate other characters.

ISO-8859-1 and Windows-1252 have the same encoding for ä but Mac OS Roman does not. This historical reality has caused problems since the early DOS/MAC days.

UTF-8 and UTF-16 are encoding schemes. Whereas ISO-8859-1, Windows-1252 and Mac OS Roman have character code-points and character encodings that coincide (are the same), the UTF encoding schemes are simply character encoding schemes for Unicode.

Unicode?

Unicode defines characters in a mega table. Most characters in Unicode are defined by a 16-bit code point. Not surprisingly, UTF-16 character encoding is very often the same as the Unicode code-point. UTF-8, on the other hand, is not. Notice that both UTF-8 and UTF-16 use two bytes to encode ä but that the encoding is different.

Listing 2. später using a single code-point for ä

        Name:   s       p        ä       t       e       r
              -----   -----   -----   -----   -----   -----
      UTF-16: 00 73   00 70   00 E4   00 74   00 65   00 72
       UTF-8:    73      70   C3 A4      74      65      72
  ISO-8859-1:    73      70      E4      74      65      72
Windows-1252:    73      70      E4      74      65      72
Mac OS Roman:    73      70      A8      74      65      72

In addition to character code-points with “accents”, Unicode also defines diacritical marks. Some diacritics are combining and some are not.

Diacritic what?

Diacritics are “accents” which are defined separately from the actual character. When a character is preceded or followed by a combining diacritical mark, the two are combined to form a single character. In the case of ä, the character ‘a’ plus the diacritical combining mark ‘¨’ are combined by the rendering engine (e.g., the browser) into a single character ‘ä’. (See also, glyph).

Not all diacritics are “accents” nor are they combining. The apostrophe is considered a non-combining diacritical mark and is not an “accent”. For example, the apostrophe in “I have Carol’s book” is used to denote possesion rather than pronuciation.

The 8-bit code tables do not have a separate code-point for ‘¨’ and cannot represent it. MySQL’s default character encoding when creating tables is ISO-8859-1. You can change the defaults but out of the box, it’s ISO-8859-1. Most XML I deal with is UTF-8 encoded.

If you have a really clever conversion routine, you can convert später from UTF-8 into ISO-8859-1. I don’t recommend it. Eventually, you will run into characters that simply cannot be mapped into the 8-bit code space of ISO-8859-1. How prevalent is this problem? Consider the musicians Sinéad O’Connor, Björk, and 松居和.

Notice, again, the difference between UTF-8 and UTF-16 character encodings.

Listing 3. später using a combinine diacritical mark, a + ¨ --> ä

        Name:   s       p           ä           t       e       r
              -----   -----   -------------   -----   -----   -----
      UTF-16: 00 73   00 70   00 61   03 08   00 74   00 65   00 72
       UTF-8:    73      70      61   CC 88      74      65      72
  ISO-8859-1:    73      70      61     ?        74      65      72
Windows-1252:    73      70      61     ?        74      65      72
Mac OS Roman:    73      70      61     ?        74      65      72

There are other 8-bit code tables for many of the phonetic alphabets. The english word ‘later’ translated into Russian is ‘позже’. Each 8-bit code table is tied to a specific alphabet.

List of ISO-8859 code tables.

ISO-8859-1  Latin-1, Western European
ISO-8859-2  Latin-2, Central European
ISO-8859-3  Latin-3, South European
ISO-8859-4  Latin-4, North European
ISO-8859-5  Latin/Cyrillic
ISO-8859-6  Latin/Arabic
ISO-8859-7  Latin/Greek
ISO-8859-8  Latin/Hebrew
ISO-8859-9  Latin-5, Turkish
ISO-8859-10 Latin-6, Nordic
ISO-8859-11 Latin/Thai
ISO-8859-12 -- abandoned --
ISO-8859-13 Latin-7, Baltic Rim
ISO-8859-14 Latin-8, Celtic
ISO-8859-15 Latin-9
ISO-8859-16 Latin-10, South-Eastern European

In each of these ISO-8859 tables, the first 127 characters are identical with 7-bit ASCII.

There are separate code tables for ISO-8859, Windows and Mac since ‘позже’ cannot be represented in their Latin-1 equivalents.

Listing 4. позже

        Name:   п       о       з       ж       е
              -----   -----   -----   -----   -----
      UTF-16: 04 3F   04 3E   04 37   04 36   04 35
       UTF-8: D0 BF   D0 BE   D0 B7   D0 B6   D0 B5
  ISO-8859-5:    DF      DE      D7      D6      D5
Windows-1251:    EF      EE      E7      E6      E5
Mac-Cyrillic:    EF      EE      E7      E6      E5

How, then, can you encode a document using one of the ISO-8859 tables that contains the three words ‘later’, ‘später’ and ‘позже’?

You can’t. Not with the 8-bit tables. But Unicode is different. It is a unifying code table with more than 100,000 characters.

The 8-bit tables combine both the code-point and the character encoding. That is, the code point for the letter ‘a’ in ISO-8859-1 is 0x61 and the character encoding for ‘a’–the bit pattern used to represent ‘a’ in the computer–is also 0x61.

Unicode is not a character encoding. It is strictly a character code table. Therefore, Unicode needs an encoding scheme to represent the character as a bit pattern. Since the code table is not tied to any character encoding, any suitable encoding scheme may be used. Many have emerged, most notably UTF-8 and UTF-16.

Eventually, you run into a scenario where none of the ISO-8859 tables will work for you. Consider devanagari, the script used in Hindi. The ISO-8859 table for devanagari was abandoned and India uses their own table, ISCII. (Like ASCII but with an ‘I’). The Hindi word for later is ‘बाद में’. Devanagari makes liberal use of diacritical marks.

Listing 5. बाद में 
Name:      ब         ा          द                          में 
        --------  --------  --------  -----  ----------------------------
UTF-16:    09 2C     09 3E     09 26  00 20     09 2E     09 47     09 02
 UTF-8: E0 A4 AC  E0 A4 BE  E0 A4 A6     20  E0 A4 AE  E0 A5 87  E0 A4 82
 ISCII:       DA        EA        D4     20        DC        F1        B2

By now, you get the point. If you must deal with various character sets, Unicode is your friend. That’s great. Wow! All those phonetic alphabets in one table. “I bet this works great for Chinese character”, you say? It does but it has yet to catch on in the Chinese speaking world.

Chinese characters come in two flavors: Traditional and Simplified. The english word ‘later’ is 稍後 in traditional characters and 稍后 in simplified characters. It is the same word, the same pronuciation but a different rendering.

Chinese speaking countries have adopted either Traditional or Simplified characters as their national standard. Many characters are visually the same in both standards (e.g., 稍) while others are not (e.g., 後 and 后). And there are numerous character encoding schemes that exist solely for the Chinese character sets. Two standards emerge as the leaders.

Traditional character countries gravitate towards Big5 while Simplified character countries lean towards GB2312. These are fixed width, 16-bit code tables.

Notice, however, that both Big5 and GB2312 have code points for and character encodings for both traditional and simplified characters.

Listing 6. 稍後

  Name:    稍          後
        --------   --------
UTF-16:    7A 0D      5F 8C
 UTF-8: E7 A8 8D   E5 BE 8C
  Big5:    B5 78      AB E1
GB2312:    C9 D4      E1 E1

Listing 7. 稍后

  Name:    稍          后
        --------   --------
UTF-16:    7A 0D      54 0E
 UTF-8: E7 A8 8D   E5 90 8E
  Big5:    B5 78      A6 59
GB2312:    C9 D4      BA F2

You will find that UTF-8 character encoding for Chineses characters generally require three bytes (or more) whereas Big5 and GB2312 require only two. This becomes a serious issue for web application internationalization (i18n). If you create MySQL tables to use UTF-8, you will need to translate any differing character encoding into UTF-8. This isn’t so much a problem with XML (as it is usually UTF-8, in my experience) or with your own website (as you have control over the front end.

Lastly, Unicode is not a 16-bit table. It’s bit independent. When more space is needed, the Unicode folks will add more table space. For example, the Unicode code points for Tai Xuan Jing Symbols (under Miscellaneous Symbols and Dingbats) define a code-point for the symbol for eternity at U+1D33A. The system fonts for Macintosh include a rendering for this symbol. I don’t know if they exist on other platforms so I include a picture of it.

Listing 8. (TETRAGRAM FOR ETERNITY)

UTF-16: D8 34 DF 3A
 UTF-8: F0 9D 8C BA

And there you have it. Character codes and encoding. But this isn’t the end of the story. To complete the round trip from web page to database to web page, you will need to contend with URL encoding and configuring your webapp. But that will need to wait for another day.