<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Red Leopard &#187; unicode</title>
	<atom:link href="http://www.redleopard.com/tag/unicode/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.redleopard.com</link>
	<description>A Stranger in a Strange Land</description>
	<lastBuildDate>Tue, 20 Dec 2011 14:55:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Unicode Backlash</title>
		<link>http://www.redleopard.com/2009/08/unicode-backlash/</link>
		<comments>http://www.redleopard.com/2009/08/unicode-backlash/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 20:12:08 +0000</pubDate>
		<dc:creator>kelly</dc:creator>
				<category><![CDATA[KellyBlog]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://www.redleopard.com/?p=798</guid>
		<description><![CDATA[Yesterday, I stumbled upon a blog that made me laugh. The truth be told, I have a snarky side—my evil twin, if you will. I keep it in check. Mostly. Some of you who will understand. Some of you won&#8217;t. It&#8217;s a Gemini thing. My twin is fun but good luck putting it back in [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday, I stumbled upon a blog that made me laugh. The truth be told, I have a snarky side—my evil twin, if you will. I keep it in check. Mostly. Some of you who will understand. Some of you won&#8217;t. It&#8217;s a Gemini thing.</p>
<p>My twin is fun but good luck putting it back in its cage. And believe me, for every dollar of fun you get, you&#8217;ll pay ten dollars in social disaster. Best to keep the twin in check.</p>
<p>But yesterday, I stumbled upon Ted Dziuba&#8217;s blog and—holy smokes—my twin was rattling its cage! Ted&#8217;s rant on <a href="http://teddziuba.com/2009/07/this-is-america-take-your-unic.html">unicode</a> is spot on. I&#8217;ll further say, unicode is super-important only to people for whom unicode is super-important. If you&#8217;re backend services only understand ASCII then unicode is anti-important.</p>
<p>For example, <a href="http://www.youtube.com/results?search_query=%E6%97%A0%E6%B3%95%E5%81%9C%E4%B8%8B">YouTube</a> is happy with 无法停下 as are most of the online music services. But that cost money, believe it. If you can make do with ASCII, you&#8217;ll save money and a lot of headache.</p>
<p>Oh, how did I stumble upon Ted&#8217;s place? Saw his <a href="http://teddziuba.com/2008/04/im-going-to-scale-my-foot-up-y.html">scalability article</a> in the <a href="http://highscalability.com/hot-links-2009-8-26">Hot Links</a> list on highscalability.</p>
<p>My twin rip snorted.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.redleopard.com/2009/08/unicode-backlash/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Verdana Hates Pinyin</title>
		<link>http://www.redleopard.com/2009/08/verdana-hates-pinyin/</link>
		<comments>http://www.redleopard.com/2009/08/verdana-hates-pinyin/#comments</comments>
		<pubDate>Sat, 08 Aug 2009 22:09:16 +0000</pubDate>
		<dc:creator>kelly</dc:creator>
				<category><![CDATA[KellyBlog]]></category>
		<category><![CDATA[mandarin]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://www.redleopard.com/?p=667</guid>
		<description><![CDATA[I stumbled across an article on lostlaowai.com www.lostlaowai.com/survival-chinese which lead me to poke around the site a bit. At the above URL, I noticed that some of the combining diacritical marks (tone marks) used in writing pinyin were not rendering properly. I had not seen this problem before. It didn&#8217;t make sense. Things that don&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>I stumbled across an article on lostlaowai.com</p>
<p style="padding-left: 2em;"><a href="http://www.lostlaowai.com/survival-chinese">www.lostlaowai.com/survival-chinese</a></p>
<p>which lead me to poke around the site a bit. At the above URL, I noticed that some of the combining diacritical marks (tone marks) used in writing pinyin were not rendering properly. I had not seen this problem before. It didn&#8217;t make sense.</p>
<p>Things that don&#8217;t make sense bug me. And being something of a character geek, I couldn&#8217;t let it go. So I tried to reproduce the problem in a test example. I couldn&#8217;t. That&#8217;s when I discovered a quirky Mac OS X copy+paste issue. I sensed there was a problem but the truth was elusive. You can&#8217;t see that copy+paste changes the string characters unless you look at a binary dump of the file (which I did).</p>
<p>Okay, the mandarin word for &#8216;good&#8217; is 好 and in <a href="http://en.wikipedia.org/wiki/Pinyin">pinyin</a> is written &#8216;hǎo&#8217;. It&#8217;s possible to write the pinyin using codpoints from just the unicode Latin block.</p>
<div class="terminal">
<pre>
Latin Extended-B (Latin)
latin small letter a with caron
Unicode  01CE
UTF-8    C7 8E

   h    ǎ    o
0068 01CE 006F
</pre>
</div>
<p>It&#8217;s also possible to write the pinyin using Combining Diacritical Marks.</p>
<div class="terminal">
<pre>
Combining Diacritical Marks (Combining Marks)
combining caron
Unicode  030C
UTF-8    CC 8C

   h    a 030C    o
0068 0061 030C 006F
</pre>
</div>
<p>Note that the combining mark comes after the character it decorates. This is in contrast to Mac OS X&#8217;s U.S. Extended Keyboard input method which preceeds the character to decorate with a modifier letter. However, the modifier letter is not a combining mark. You cannot create a byte sequence that a browser renders as hǎo, it will come out as hˇao.</p>
<div class="terminal">
<pre>
Spacing Modifier Letters (Modifier Letters)
caron
Unicode  02C7
UTF-8    C8 87

   h 02C7    a    o
0068 02C7 0061 006F

NOTE: the caron does not combine with the a; OS X does not
modify the 'a' to have a caron above.
</pre>
</div>
<p>OS X input method uses the modifier letter to lookup an equivalent codepoint in unicode&#8217;s latin block.</p>
<div class="terminal">
<pre>
Using OS X's US Extended Keyboard Input Method
opt-v + a

   h 02C7    a    o              h    ǎ    o
0068 02C7 0061 006F    ==>    0068 01CE 006F

Note: the caron combines with the a; OS X automatically
converts 02C7 + 0061 into 01CE.
</pre>
</div>
<p>To check the code points, I used this handy tool:</p>
<p>  <a href="http://people.w3.org/rishida/scripts/uniview/conversion.php">people.w3.org/rishida/scripts/uniview/conversion.php</a></p>
<ol>
<li>open the OS X character pallete</li>
<li>Go to the URL above</li>
<li>place the cursor in the upper left box labeled Characters</li>
<li>type the letter h into the box</li>
<li>type the letter a into the box</li>
<li>from character pallete, insert character 030C into the box</li>
<li>type the letter o into the box</li>
<li>click the convert button just above the Characters box, the UTF-16 Code units box will have the sequence (in unicode code points) 0068 0061 030C 006F</li>
<li>select and copy (cmd+c) the contents of the Characters box</li>
<li>immediately paste contents back into the Characters box</li>
<li>click the convert button just above the Characters box, the UTF-16 Code units box now has the sequence 0068 01CE 006F</li>
</ol>
<p>Aha! The copy and paste operation changed the string&#8217;s character code points! Imagine my surprise.</p>
<p>That mystery solved, I next dove into the lostlaowai source code. This was my first encounter with using character entity encoding of the combining diacritical marks. Rather than type the characters directly into the source code, like this</p>
<div class="terminal">
<pre>hǎo</pre>
</div>
<p>lostlaowai encoded the non-ascii characters like this</p>
<div class="terminal">
<pre>ha&amp;#780;o</pre>
</div>
<p>even though the page encoding was declared as UTF-8</p>
<div class="terminal">
<pre>&lt;meta http-equiv="content-type" content="text/html; charset=utf-8" /&gt;</pre>
</div>
<p>Maybe it&#8217;s a joomla thing. lostlaowai uses joomla.</p>
<p>After a quick bout of deleting blocks of source code, I isolated the culprit!</p>
<div style="width: 592px;"><img width="224" height="305" alt="screenshot" style="float: right; margin: 0 0 0.5ex 0.5em;" src="/images/screenshots/wonderful.jpg" /> </div>
<div class="terminal">
<pre>
&lt;html&gt;
&lt;head&gt;
  &lt;meta
    http-equiv="content-type"
    content="text/html; charset=utf-8"&gt;
  &lt;title&gt;wonderful.html&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;

&lt;pre&gt;
   好極了！
1. ha&amp;#780;o ji&amp;#769;le!
2. hǎo jíle!
&lt;span style="font-family: Verdana,
   Arial, Helvetica, sans-serif;"&gt;
3. ha&amp;#780;o ji&amp;#769;le!
4. hǎo jíle!
&lt;/span&gt;&lt;span style="font-family:
   Arial, Helvetica, sans-serif;"&gt;
5. ha&amp;#780;o ji&amp;#769;le!
6. hǎo jíle!
&lt;/span&gt;&lt;/pre&gt;
&lt;/body&gt;
&lt;/html&gt;
</pre>
</div>
<p>Source code: <a href="/code/wonderful.html">wonderful.html</a></p>
<p>Adding Verdana to the font family causes the problem. I <a href="http://www.google.com/search?hl=en&#038;num=100&#038;as_q=Verdana+Combining+Diacritical+Marks">searched</a> to see if anyone else had seen this problem. Indeed. Wikipedia.org has en entry on <a href="http://en.wikipedia.org/wiki/Verdana#Combining_characters_bug">a similar bug</a> and fileformat.info <a href="http://www.fileformat.info/info/unicode/font/verdana/blockview.htm?block=combining_diacritical_marks">lists the five</a> marks supported by Verdana. That&#8217;s sad. Verdana only supports 5 of the 112 code points in unicode&#8217;s Combining Diacritical Marks block.</p>
<p>The Verdana typeface, released in 1996, was created for and is owned by Microsoft. If Microsoft hasn&#8217;t fixed Verdana after more than a decade, I&#8217;ll assume they never will and prudence suggests avoid it.</p>
<p>At least avoid using Verdana in writing pinyin using combining diacritical marks. If you must use Verdana, then use codepoints from unicode&#8217;s latin block. On the Mac, this is the default when typing these characters in directly using the U.S. Extended keyboard.</p>
<div class="terminal">
<pre>
Character     ā    á    ǎ    à
Unicode    0101 00E1 01CE 00E0
------------------------------
Character     ē    é    ě    è
Unicode    0113 00E9 0118 00E8
------------------------------
Character     ī    í    ǐ    ì
Unicode    0128 00ED 01D0 00EC
------------------------------
Character     ō    ó    ǒ    ò
Unicode    014D 00F3 0102 00F2
------------------------------
Character     ū    ú    ǔ    ù
Unicode    0168 00FA 01D4 00F9
------------------------------
Character     ǖ    ǘ    ǖ    ǜ
Unicode    01D6 01D8 01D6 01DC
</pre>
</div>
<p>If you have to convert an existing web page (like the lostlaowai page mentioned above), you could take advantage of the copy+paste quirk in OS X. Simply open the web page, copy the pinyin and paste it into a text editor (e.g., back into the source). The original text is not rendered properly but that&#8217;s ok. The character codes are correct. When you paste it into the editor, OS X will convert the the char+mark into a single char from the latin code block.</p>
<p>Finally, the character &#8216;a&#8217; in pinyin is sometimes written using using the unicode codepoint 0251 &#8216;ɑ&#8217; which is still in the latin block but in the section called &#8220;IPA Extensions&#8221;. It has a different look from the standard ascii character &#8216;a&#8217;. There is no set codepoints that replace the accented characters in the chart above.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.redleopard.com/2009/08/verdana-hates-pinyin/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Character codes and encoding</title>
		<link>http://www.redleopard.com/2008/12/character-codes-and-encoding/</link>
		<comments>http://www.redleopard.com/2008/12/character-codes-and-encoding/#comments</comments>
		<pubDate>Thu, 11 Dec 2008 15:43:48 +0000</pubDate>
		<dc:creator>kelly</dc:creator>
				<category><![CDATA[KellyBlog]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://www.redleopard.com/?p=171</guid>
		<description><![CDATA[Character codes and encoding In the beginning, there was ASCII. (There were others, but we begin here with ASCII). 7-bit ASCII in an 8-bit package. Using only the first seven bits of a byte, standard ASCII could not deal with diacritical marks (accents and funny dots in the vulgar vernacular). Therefore, the German word for [...]]]></description>
			<content:encoded><![CDATA[<p>Character codes and encoding</p>
<p>In the beginning, there was ASCII. (There were others, but we begin here with ASCII).</p>
<p>7-bit ASCII in an 8-bit package. Using only the first seven bits of a byte, standard ASCII could not deal with diacritical marks (accents and funny dots in the vulgar vernacular). Therefore, the German word for later, später, would sometimes be transliterated into spaeter. I suspect this started in the days before computers when typewriters had only so many keys. You had to make do.</p>
<p>The computer age began before most of us were born. And with it was a need to encode characters into bits. And since there exists more than the simple &#8220;unaccented&#8221; latin characters in the world, other standards emerged. However, most surviving systems use the basic 7-bit ASCII characters as their starting point.</p>
<div class="terminal">
<pre>
Listing 1. später transliterated into spaeter

Name:   s       p       a       e       t       e       r
              -----   -----   -----   -----   -----   -----   -----
      UTF-16: 00 73   00 70   00 61   00 65   00 74   00 65   00 72
       UTF-8:    73      70      61      65      74      65      72
  ISO-8859-1:    73      70      61      65      74      65      72
Windows-1252:    73      70      61      65      74      65      72
Mac OS Roman:    73      70      61      65      74      65      72
</pre>
</div>
<p>ISO-8859-1, Windows-1252 and Mac OS Roman are all 8-bit code tables. That is, every character defined in these standards is represented by 8-bits. These represent characters used by most of the western European languages.</p>
<p>In contrast, UTF-16 is represented by two bytes and UTF-8 is represented by one or more bytes.</p>
<p>Basic ASCII uses only the lower 127 code-points in a byte (ergo, the reference to 7-bit ASCII). But there are 8-bits in a byte. That leaves the upper half of the code-points open to accomodate other characters.</p>
<p>ISO-8859-1 and Windows-1252 have the same encoding for ä but Mac OS Roman does not. This historical reality has caused problems since the early DOS/MAC days.</p>
<p>UTF-8 and UTF-16 are encoding schemes. Whereas ISO-8859-1, Windows-1252 and Mac OS Roman have character code-points and character encodings that coincide (are the same), the UTF encoding schemes are simply character encoding schemes for Unicode.</p>
<p>Unicode?</p>
<p>Unicode defines characters in a mega table. Most characters in Unicode are defined by a 16-bit code point. Not surprisingly, UTF-16 character encoding is very often the same as the Unicode code-point. UTF-8, on the other hand, is not. Notice that both UTF-8 and UTF-16 use two bytes to encode ä but that the encoding is different.</p>
<div class="terminal">
<pre>
Listing 2. später using a single code-point for ä

        Name:   s       p        ä       t       e       r
              -----   -----   -----   -----   -----   -----
      UTF-16: 00 73   00 70   00 E4   00 74   00 65   00 72
       UTF-8:    73      70   C3 A4      74      65      72
  ISO-8859-1:    73      70      E4      74      65      72
Windows-1252:    73      70      E4      74      65      72
Mac OS Roman:    73      70      A8      74      65      72
</pre>
</div>
<p>In addition to character code-points with &#8220;accents&#8221;, Unicode also defines diacritical marks. Some diacritics are combining and some are not.</p>
<p>Diacritic what?</p>
<p>Diacritics are &#8220;accents&#8221; which are defined separately from the actual character. When a character is preceded or followed by a combining diacritical mark, the two are combined to form a single character. In the case of ä, the character &#8216;a&#8217; plus the diacritical combining mark &#8216;¨&#8217; are combined by the rendering engine (e.g., the browser) into a single character &#8216;ä&#8217;. (See also, glyph).</p>
<p>Not all diacritics are &#8220;accents&#8221; nor are they combining. The apostrophe is considered a non-combining diacritical mark and is not an &#8220;accent&#8221;. For example, the apostrophe in &#8220;I have Carol&#8217;s book&#8221; is used to denote possesion rather than pronuciation.</p>
<p>The 8-bit code tables do not have a separate code-point for &#8216;¨&#8217; and cannot represent it. MySQL&#8217;s default character encoding when creating tables is ISO-8859-1. You can change the defaults but out of the box, it&#8217;s ISO-8859-1. Most XML I deal with is UTF-8 encoded.</p>
<p>If you have a really clever conversion routine, you <em>can</em> convert später from UTF-8 into ISO-8859-1. I don&#8217;t recommend it. Eventually, you will run into characters that simply cannot be mapped into the 8-bit code space of ISO-8859-1. How prevalent is this problem? Consider the musicians Sinéad O&#8217;Connor, Björk, and <a href="http://www.keikomatsui.com/v1/kazu.html">松居和</a>.</p>
<p>Notice, again, the difference between UTF-8 and UTF-16 character encodings.</p>
<div class="terminal">
<pre>
Listing 3. später using a combinine diacritical mark, a + ¨ --> ä

        Name:   s       p           ä           t       e       r
              -----   -----   -------------   -----   -----   -----
      UTF-16: 00 73   00 70   00 61   03 08   00 74   00 65   00 72
       UTF-8:    73      70      61   CC 88      74      65      72
  ISO-8859-1:    73      70      61     ?        74      65      72
Windows-1252:    73      70      61     ?        74      65      72
Mac OS Roman:    73      70      61     ?        74      65      72
</pre>
</div>
<p>There are other 8-bit code tables for many of the phonetic alphabets. The english word &#8216;later&#8217; translated into Russian is &#8216;позже&#8217;. Each 8-bit code table is tied to a specific alphabet.</p>
<p>List of <a href="http://en.wikipedia.org/wiki/ISO-8859">ISO-8859</a> code tables.</p>
<div class="terminal">
<pre>
ISO-8859-1  Latin-1, Western European
ISO-8859-2  Latin-2, Central European
ISO-8859-3  Latin-3, South European
ISO-8859-4  Latin-4, North European
ISO-8859-5  Latin/Cyrillic
ISO-8859-6  Latin/Arabic
ISO-8859-7  Latin/Greek
ISO-8859-8  Latin/Hebrew
ISO-8859-9  Latin-5, Turkish
ISO-8859-10 Latin-6, Nordic
ISO-8859-11 Latin/Thai
ISO-8859-12 -- abandoned --
ISO-8859-13 Latin-7, Baltic Rim
ISO-8859-14 Latin-8, Celtic
ISO-8859-15 Latin-9
ISO-8859-16 Latin-10, South-Eastern European
</pre>
</div>
<p>In each of these ISO-8859 tables, the first 127 characters are identical with 7-bit ASCII.</p>
<p>There are separate code tables for ISO-8859, Windows and Mac since &#8216;позже&#8217; cannot be represented in their Latin-1 equivalents.</p>
<div class="terminal">
<pre>
Listing 4. позже

        Name:   п       о       з       ж       е
              -----   -----   -----   -----   -----
      UTF-16: 04 3F   04 3E   04 37   04 36   04 35
       UTF-8: D0 BF   D0 BE   D0 B7   D0 B6   D0 B5
  ISO-8859-5:    DF      DE      D7      D6      D5
Windows-1251:    EF      EE      E7      E6      E5
Mac-Cyrillic:    EF      EE      E7      E6      E5
</pre>
</div>
<p>How, then, can you encode a document using one of the ISO-8859 tables that contains the three words &#8216;later&#8217;, &#8216;später&#8217; and &#8216;позже&#8217;?</p>
<p>You can&#8217;t. Not with the 8-bit tables. But <a href="http://www.unicode.org/">Unicode</a> is different. It is a unifying code table with more than <a href="http://en.wikipedia.org/wiki/Unicode">100,000 characters</a>.</p>
<p>The 8-bit tables combine both the code-point and the character encoding. That is, the code point for the letter &#8216;a&#8217; in ISO-8859-1 is 0&#215;61 and the character encoding for &#8216;a&#8217;&#8211;the bit pattern used to represent &#8216;a&#8217; in the computer&#8211;is also 0&#215;61.</p>
<p>Unicode is not a character encoding. It is strictly a character code table. Therefore, Unicode needs an encoding scheme to represent the character as a bit pattern. Since the code table is not tied to any character encoding, any suitable encoding scheme may be used. Many have emerged, most notably UTF-8 and UTF-16.</p>
<p>Eventually, you run into a scenario where none of the ISO-8859 tables will work for you. Consider devanagari, the script used in Hindi. The ISO-8859 table for devanagari was abandoned and India uses their own table, ISCII. (Like ASCII but with an &#8216;I&#8217;). The Hindi word for later is &#8216;बाद में&#8217;. Devanagari makes liberal use of diacritical marks.</p>
<div class="terminal">
<pre>
Listing 5. बाद में
Name:      ब         ा          द                          में
        --------  --------  --------  -----  ----------------------------
UTF-16:    09 2C     09 3E     09 26  00 20     09 2E     09 47     09 02
 UTF-8: E0 A4 AC  E0 A4 BE  E0 A4 A6     20  E0 A4 AE  E0 A5 87  E0 A4 82
 ISCII:       DA        EA        D4     20        DC        F1        B2
</pre>
</div>
<p>By now, you get the point. If you must deal with various character sets, Unicode is your friend. That&#8217;s great. Wow! All those phonetic alphabets in one table. &#8220;I bet this works great for Chinese character&#8221;, you say? It does but it has yet to catch on in the Chinese speaking world.</p>
<p>Chinese characters come in two flavors: <a href="http://en.wikipedia.org/wiki/Traditional_Chinese_characters">Traditional</a> and <a href="http://en.wikipedia.org/wiki/Simplified_Chinese_characters">Simplified</a>. The english word &#8216;later&#8217; is 稍後 in traditional characters and 稍后 in simplified characters. It is the same word, the same pronuciation but a different rendering.</p>
<p>Chinese speaking countries have adopted either Traditional or Simplified characters as their national standard. Many characters are visually the same in both standards (e.g., 稍) while others are not (e.g., 後 and 后). And there are numerous character encoding schemes that exist solely for the Chinese character sets. Two standards emerge as the leaders.</p>
<p>Traditional character countries gravitate towards <a href="http://en.wikipedia.org/wiki/Big5">Big5</a> while Simplified character countries lean towards <a href="http://en.wikipedia.org/wiki/Gb2312">GB2312</a>. These are fixed width, 16-bit code tables.</p>
<p>Notice, however, that both Big5 and GB2312 have code points for and character encodings for both traditional and simplified characters.</p>
<div class="terminal">
<pre>
Listing 6. 稍後

  Name:    稍          後
        --------   --------
UTF-16:    7A 0D      5F 8C
 UTF-8: E7 A8 8D   E5 BE 8C
  Big5:    B5 78      AB E1
GB2312:    C9 D4      E1 E1
</pre>
</div>
<div class="terminal">
<pre>
Listing 7. 稍后

  Name:    稍          后
        --------   --------
UTF-16:    7A 0D      54 0E
 UTF-8: E7 A8 8D   E5 90 8E
  Big5:    B5 78      A6 59
GB2312:    C9 D4      BA F2
</pre>
</div>
<p>You will find that UTF-8 character encoding for Chineses characters generally require three bytes (or more) whereas Big5 and GB2312 require only two. This becomes a serious issue for web application <a href="http://en.wikipedia.org/wiki/Internationalization">internationalization</a> (i18n). If you create MySQL tables to use UTF-8, you will need to translate any differing character encoding into UTF-8. This isn&#8217;t so much a problem with XML (as it is usually UTF-8, in my experience) or with your own website (as you have control over the front end.</p>
<p>Lastly, Unicode is not a 16-bit table. It&#8217;s bit independent. When more space is needed, the Unicode folks will add more table space. For example, the Unicode code points for Tai Xuan Jing Symbols (under Miscellaneous Symbols and Dingbats) define a code-point for the symbol for eternity at U+1D33A. The system fonts for Macintosh include a rendering for this symbol. I don&#8217;t know if they exist on other platforms so I include a picture of it.</p>
<p><img width="43" height="48" alt="TETRAGRAM FOR ETERNITY" src="http://www.redleopard.com/images/1D33A-eternity.png" /></p>
<div class="terminal">
<pre>
Listing 8. (TETRAGRAM FOR ETERNITY)

UTF-16: D8 34 DF 3A
 UTF-8: F0 9D 8C BA
</pre>
</div>
<p>And there you have it. Character codes and encoding. But this isn&#8217;t the end of the story. To complete the round trip from web page to database to web page, you will need to contend with URL encoding and configuring your webapp. But that will need to wait for another day.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.redleopard.com/2008/12/character-codes-and-encoding/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

