Unicode Backlash

Yesterday, I stumbled upon a blog that made me laugh. The truth be told, I have a snarky side—my evil twin, if you will. I keep it in check. Mostly. Some of you who will understand. Some of you won’t. It’s a Gemini thing.

My twin is fun but good luck putting it back in its cage. And believe me, for every dollar of fun you get, you’ll pay ten dollars in social disaster. Best to keep the twin in check.

But yesterday, I stumbled upon Ted Dziuba’s blog and—holy smokes—my twin was rattling its cage! Ted’s rant on unicode is spot on. I’ll further say, unicode is super-important only to people for whom unicode is super-important. If you’re backend services only understand ASCII then unicode is anti-important.

For example, YouTube is happy with 无法停下 as are most of the online music services. But that cost money, believe it. If you can make do with ASCII, you’ll save money and a lot of headache.

Oh, how did I stumble upon Ted’s place? Saw his scalability article in the Hot Links list on highscalability.

My twin rip snorted.

Speech Reference Materials

Below are several resources I used in writing a speech delivered this morning for Early Risers Toastmasters entitled, “Feedback Loops in Personal Practices.” For those who were unable to attend, the talk focused on somatic learning and the importance of personal practices. (I am interested in personal practices as access to metaprogramming.) Feedback loops are important to mitigate the risk adopting destructive practices or of improperly performing the practice. I concluded with a tie-in reference to Theo’s new project of video recording the clubs speeches (if the speaker requests it).

When I promised this morning that I would publish the list, I didn’t think it would be such an ordeal. But when I stared at the bald list of book titles and author names, I knew it was lacking. The soup just needed a bit more seasoning.

Of course, any speech draws upon a lifetime of experiences. These are the books pulled from the bookcase and stacked upon my desk from which I double-checked material.

Adele Westbrook and Oscar Ratti
"Aikido and the Dynamic Sphere"
ISBN: 978-0804832847

GOOGLE: "Aikido and the Dynamic Sphere"
SOURCE: http://books.google.com/books?id=AWAMSZfc97EC
AMAZON: http://www.amazon.com/dp/0804832846

Adele Westbrook studied philosophy at Columbia University and has since made a career in advertising and publishing. She is currently an executive for a New York City publishing company.

Oscar Ratti received a degree in classical studies and law from the University of Naples, where we was an intercollegiate Greco-Roman wrestling champion, as well as a member of the championship judo team. Mr. Ratti is a commercial illustrator, and he serves as a design consultant for traditional and web-based publications.

Ms. Westbrook and Mr. Ratti have together taught aikido in New York, working with youth groups at centers affiliated with the YMCA.

KELLY: “I’ve read a number of Aikido books and found this to be the most enjoyable and the most pertinent to somatic learning. It wasn’t written for that purpose so you’ll have to work at ‘seeing the broader picture’. If you simply want a contextual basis for reading parts of Strozzi-Heckler’s ‘The Leadership Dojo’, I recommend speed reading the material. However, I invite you to participate in Aikido classes over an extended period, at least a year. My rationale is consistent with the following excerpt from the Institute of Transpersonal Psychology:”

ITP: “It may seem paradoxical to include martial arts practice as an important aspect to being a therapist. When we think of the martial arts, words such as, ‘opponent’, ‘defeat’, and ‘against’ often come to mind. However, Aikido differs from disciplines such as karate, tai chi, and even yoga because it emphasizes the importance of blending with your partner. In Aikido, as in therapy, it is necessary to read body language and understand the intention of the person with whom you are working. These are some of the fundamental reasons that ITP requires the study of Aikido for our Residential students.” [complete article]

Richard Strozzi-Heckler
"The Leadership Dojo"
ISBN: 978-1583942017

GOOGLE: "Richard Strozzi-Heckler" "The Leadership Dojo"
SOURCE: http://books.google.com/books?id=87a_giMS88UC
AMAZON: http://www.amazon.com/dp/1583942017

Richard Strozzi Heckler, PhD is the founder and President of Strozzi Institute. A nationally known speaker and consultant on leadership and mastery, he has spent more than three decades researching, developing, and teaching the practical application of Somatics (the unity of language, action, and meaning) to business leaders and executive managers.

EXCERPT: “300 repetitions produce body memory, which is the ability to enact the correct movement, technique, or conversation by memory. It’s also been pointed out that 3,000 repititions creates embodiment, which is not having to think about doing the activity–it’s simply part of who we are.”

KELLY: “This was the book that pulled a lot of the other material together for me. It is not an academic study and I wouldn’t use it alone as an authoritative source. It does present a coherent description of somatic learning as practiced by the author in his training business. I suggest reading this material after having studied the other references mentioned in this list.”

Tracy Goss
"The Last Word on Power"
ISBN: 978-0385474924

GOOGLE: "Tracy Goss" "The Last Word on Power"
AMAZON: http://www.amazon.com/dp/038547492X

Tracy Goss is President of Goss-Reid Associates, a management consulting firm based in Austin, Texas. She specializes in working with CEOs and their senior management teams, worldwide, to invent and strategically plan an “impossible future” and to “re-invent” themselves and their executive cadre to successfully lead their organization into that future.

EXCERPT: “Language is the only leverage for changing the context of the world around you. This is because people apprehend and construct reality through the way they speak and listen. Or, as Martin Heidegger put it, “Language is the house of being.”

KELLY: “This is the best publicly available source of Cylon (i.e., what drives us is a mechanical process controlled by our structures of interpretation) doctrine. It may not be as accessible to readers who have not participated in a Cylon-esque education program. The material becomes clearer through experience. If you only READ the material, that’s okay. You will benefit from even just a conceptual understanding before reading the other reference material. I include ‘The Last Word on Power’ because it shares many aspects with somatic learning while remaining, in many ways, incompatible with somatic learning. Puzzling out exactly where is educational.”

Malcolm Gladwell  [wikipedia.org]
"Blink"
ISBN: 978-0316010665

GOOGLE: "Malcolm Gladwell" Blink
AMAZON: http://www.amazon.com/dp/0316010669

Malcolm Gladwell is a British-born Canadian journalist, author, and pop sociologist, based in New York City. He has been a staff writer for The New Yorker since 1996. He is best known as the author of the books The Tipping Point (2000), Blink (2005), and Outliers (2008).

KELLY: “I like Gladwell’s use of Paul Ekman’s work on facial expressions. While Gladwell isn’t an academic researcher, his treatment of facial expressions is both entertaining and easy to remember. I also like Gladwell’s treatment of John Gottman’s work on ‘thin slicing’. I integrated Ekman and Gottman’s work and juxtaposed that against Strozzi-Heckler’s material on somatic learning and what arose was in interesting postulate: ‘The mind makes its thoughts real for the body and the body/experience programs the mind.'”

James Robbins
"Build a Better Buddha"
ISBN: 978-0892540655

GOOGLE: "James Robbins" "Build a Better Buddha"
AMAZON: http://www.amazon.com/dp/0892540656

James Robbins holds two graduate degrees, a Master’s degree in English literature from the University of Texas at Austin, and a master’s degree in professional counseling from Amberton University in Dallas. His first book of non-fiction, Build A Better Buddha, was published in 2003 by Nicolas-Hays, Inc. In 2004, Tony Robbins, world-renowned peak performance coach, personally selected this book as motivational reading for his elite, international group of Platinum Partnership clients. Better Buddha examines a cross-section of East and West, integrating aspects of Western psychology with Eastern philosophy.

KELLY: “I don’t even know where to begin. Most of what’s being said today has been said millennia ago. A good primer.”

K. Anders Ericsson  [wikipedia.org]
"The Making of an Expert"
Harvard Business Review, July-August 2007

GOOGLE: Anders Ericsson "The Making of an Expert" filetype:pdf
SOURCE: coachingmanagement.nl

Dr. K. Anders Ericsson is Conradi Eminent Scholar and Professor of Psychology at Florida State University who is widely recognized as one of the world’s leading theoretical and experimental researchers on expertise.

EXCERPT: “By now it will be clear that it takes time to become an expert. Our research shows that even the most gifted performers need a minimum of ten years (or 10,000 hours) of intense training before they win international competitions. In some fields the apprenticeship is longer: It now takes most elite musicians 15 to 25 years of steady practice, on average, before they succeed at the international level.”

KELLY: “The Ericsson ‘10,000 hours rule’ is often cited and often used out of context. The paper is easily accessible (i.e., not pedantic) and I believe it essential to judging whether another author’s reference of Ericsson’s work is legitimate.”

Paul Ekman  [wikipedia.org]

Paul Ekman is a psychologist who has been a pioneer in the study of emotions and their relation to facial expressions. He is considered one of the 100 most eminent psychologists of the twentieth century.[1] The background of Ekman’s research analyzes the development of human traits and states over time. He retired in 2004 as professor of psychology in the Department of Psychiatry at the University of California, San Francisco (UCSF).

John Gottman  [wikipedia.org]

John Gottman, Ph.D. is known for his work on marital stability and relationship analysis through scientific direct observations published in peer-reviewed literature. Dr. Gottman found his methodology predicts with 90% accuracy which newlywed couples will remain married and which will divorce four to six years later. It is also 81% percent accurate in predicting which marriages will survive after seven to nine years. Dr. Gottman is a Professor Emeritus of psychology at the University of Washington, and with his wife Dr. Julie Gottman now heads a non-profit research institute.

Albert Mehrabian  [wikipedia.org]

Albert Mehrabian (born 1939, currently Professor Emeritus of Psychology, UCLA), has become known best by his publications on the relative importance of verbal and nonverbal messages. His findings on inconsistent messages of feelings and attitudes have been quoted throughout human communication seminars worldwide, and have also become known as the 7%-38%-55% Rule.

KELLY: “I recommend at least reading through the summary material on wikipedia as Mehrabian’s results are often misconstrued. As noted, ‘It is emphatically not the case that non-verbal elements in all senses convey the bulk of the message, though this is how his conclusions are frequently quoted.’ Nonetheless, Mehrabian’s work is adds to understanding somatic learning, in my opinion, in that voice and body are an integral element in projecting one’s message successfully or unsuccessfully; it’s not enough to focus strictly on language acts.”

Blue Light

David Roback and Hope Sandoval

Rolling Stone reports that Mazzy Star‘s “Sandoval confirms her [sic] and her bandmate David Roback haven’t called it quits and they are still working on their anticipated fourth album. But she declines to give many specifics. ‘It’s true we’re still together,’ she says. ‘We’re almost finished [with the record]. But I have no idea what that means.’”

An album. Really? A NEW album.

Just when you knew you couldn’t get any luckier. I thought the group was done. This is huge. Entire swathes of a person’s life are painted in poetry. If you’re lucky, you’ll find your poet. Nearly a decade of my life, starting in the early nineties, bask in the smokey blue light of Hope Sandoval’s lyrics and David Roback’s music. Pure poetry.

BLUE LIGHT
So Tonight That I Might See (1993)

  There's a blue light in my best friend's room
  There's a blue light in his eyes
  There's a blue light, yeah
  I want to see it, shine

  There's a ship that sails by my window
  There's a ship that sails on by
  There's a world under it
  I think I see it, sailing away

  I think it's sailing
  Miles crashing me by
  Crashing me by
  Crashing me by

  There's a world outside my doorstep
  Flames over everyone's heart
  Don't you see them shining
  I want to hear them, beating for me

  I think I hear them
  Waves crashing me by
  Crashing me by
  Crashing me by

Mandarin Wednesday I

Stanford Continuing Studies icon

I finished Mandarin Tuesday III this past Spring. There’s a lot going on at work and I must admit, I didn’t put in the same level of effort as I showed in Mandarin I and II. I believe anyone who is learning a foreign langauge will concur, class is a bitch when you’ve not put in the requisite study time. Nevertheless, perseverance pays and I crawled my way to the end. I have “completed” the entirety of “Practical Chinese Reader Book 1” but I still babble like an idiot when confronted with native Chinese speakers. So, what to do? Sign up for the next course! Pain? Haha! I laugh. I have known pain in my time. The mild embarrassment and frustration of language school is nothing. NOTHING! Bring it.

Every language starts you out with basic phrases, useful phrases like, “la tiza está en la caja.” Very useful, for instance, if you are in a bank La Paz. No one speaks English, so you empatically repeat “THE CHALK IS IN THE BOX” until an interpreter arrives. Then you can cash your traveller’s check.

At some point, it pays to move beyond these essential basic phrases and develop real communication skills.

Intermediate Chinese Conversation (registration opens Aug 17, 2009) moves from basic phrases to actual communication.

“This course is designed for students who can talk about daily life in Mandarin, know Chinese phonetic spelling (pinyin) well, and can read 200 or more Chinese characters. We will work on conversational skills in speaking Chinese. The course will focus on communication skills for travel, business, and everyday use.”

And, the class meets on Wednesday. The last three classes were on Tuesday. Wednesday is much better for me. Sometimes you get lucky.

Mandarin is fun. It’s hard but fun. I have the rest of my life to learn it.

And I shall.

一步一个脚印

bash date tricks

A quick usage note regarding the date util under bash. I sometimes want to convert between a unix timestamp and a formatted date string. I do it infrequently enough that I forget the syntax. This article is me writing down my notes.

In the following example, I want to get timestamps and date strings for both today and yesterday. Why yesterday’s date? Because I want to get yesterday’s data from google analytics’ data API. I’ve see numerous examples getting day, month and year then subtracting one from the day and propagating the underflow through the month and year. Blech! If I have today’s timestamp, I simply subtract a days worth of seconds from today and violà, yesterday!

#!/bin/bash

# Generate a current unix timestamp
#
day=$(( `date +%s` ))

# Adjust the timestamp above by 24 hours
#
seconds_in_a_day=$(( 24 * 60 * 60 ))
yesterday=$(( day - seconds_in_a_day ))
echo "timestamps"

echo "      day : ${day}"
echo "yesterday : ${yesterday}"
echo " "

# create a formatted date string (linux)
#
#echo "linux formatted string"
#echo "      day : $( date -d @${day} '+%Y%m%d' )"
#echo "yesterday : $( date -d @${yesterday} '+%Y-%m-%dT%H:%M:%S%Z' )"

# create a formatted date string (bsd/mac)
#
echo "bsd/mac formatted string"
echo "      day : $( date -r ${day} '+%Y%m%d' )"
echo "yesterday : $( date -r ${yesterday} '+%Y-%m-%dT%H:%M:%S%Z' )"

# create a formatted date string (win)
#
# echo "windows formatted string"
# echo "windows? really?"

echo " "
echo " "

Another example? Okay, let’s say I had a text file, foo, and I wanted to embed a date and get the checksum. Furthermore, I wanted the filename to include the timestamp corresponding to the embedded date. (bsd/mac version)

#!/bin/bash

# scriptname: md5tagger
# 
# generate a timestamp,
# generate output filename 
# copy formatted date string to output file 
# cat original file to output file
# copy the md5 sum to another output file
#
day=$( date +%s )
fname="$1-${day}"

echo $( date -r ${day} ) >${fname}
cat $1 >>${fname}
md5 ${fname} >${fname}.md5

Let’s try it! (bsd/mac version)

$ printf "text to copy which\ncould be important\n" >foo
$ ./md5tagger foo

$ ll foo*
-rw-r--r--  1 kelly  kelly  38 Aug 16 13:50 foo
-rw-r--r--  1 kelly  kelly  67 Aug 16 13:51 foo-1250455870
-rw-r--r--  1 kelly  kelly  56 Aug 16 13:51 foo-1250455870.md5

$ cat foo
text to copy which
could be important

$ cat foo-1250107720
Sun Aug 16 13:51:10 PDT 2009
text to copy which
could be important

$ md5 foo-1250455870; cat foo-1250455870.md5 
MD5 (foo-1250455870) = 5cf4d9f274f05b63dfde5f15659cdeb8
MD5 (foo-1250455870) = 5cf4d9f274f05b63dfde5f15659cdeb8

With linux, you substitute md5sum for md5. Of course.

Verdana Hates Pinyin

I stumbled across an article on lostlaowai.com

www.lostlaowai.com/survival-chinese

which lead me to poke around the site a bit. At the above URL, I noticed that some of the combining diacritical marks (tone marks) used in writing pinyin were not rendering properly. I had not seen this problem before. It didn’t make sense.

Things that don’t make sense bug me. And being something of a character geek, I couldn’t let it go. So I tried to reproduce the problem in a test example. I couldn’t. That’s when I discovered a quirky Mac OS X copy+paste issue. I sensed there was a problem but the truth was elusive. You can’t see that copy+paste changes the string characters unless you look at a binary dump of the file (which I did).

Okay, the mandarin word for ‘good’ is 好 and in pinyin is written ‘hǎo’. It’s possible to write the pinyin using codpoints from just the unicode Latin block.

Latin Extended-B (Latin)
latin small letter a with caron
Unicode  01CE
UTF-8    C7 8E

   h    ǎ    o
0068 01CE 006F

It’s also possible to write the pinyin using Combining Diacritical Marks.

Combining Diacritical Marks (Combining Marks)
combining caron
Unicode  030C
UTF-8    CC 8C

   h    a 030C    o
0068 0061 030C 006F

Note that the combining mark comes after the character it decorates. This is in contrast to Mac OS X’s U.S. Extended Keyboard input method which preceeds the character to decorate with a modifier letter. However, the modifier letter is not a combining mark. You cannot create a byte sequence that a browser renders as hǎo, it will come out as hˇao.

Spacing Modifier Letters (Modifier Letters)
caron
Unicode  02C7
UTF-8    C8 87

   h 02C7    a    o
0068 02C7 0061 006F

NOTE: the caron does not combine with the a; OS X does not
modify the 'a' to have a caron above.

OS X input method uses the modifier letter to lookup an equivalent codepoint in unicode’s latin block.

Using OS X's US Extended Keyboard Input Method
opt-v + a

   h 02C7    a    o              h    ǎ    o
0068 02C7 0061 006F    ==>    0068 01CE 006F

Note: the caron combines with the a; OS X automatically
converts 02C7 + 0061 into 01CE.

To check the code points, I used this handy tool:

people.w3.org/rishida/scripts/uniview/conversion.php

  1. open the OS X character pallete
  2. Go to the URL above
  3. place the cursor in the upper left box labeled Characters
  4. type the letter h into the box
  5. type the letter a into the box
  6. from character pallete, insert character 030C into the box
  7. type the letter o into the box
  8. click the convert button just above the Characters box, the UTF-16 Code units box will have the sequence (in unicode code points) 0068 0061 030C 006F
  9. select and copy (cmd+c) the contents of the Characters box
  10. immediately paste contents back into the Characters box
  11. click the convert button just above the Characters box, the UTF-16 Code units box now has the sequence 0068 01CE 006F

Aha! The copy and paste operation changed the string’s character code points! Imagine my surprise.

That mystery solved, I next dove into the lostlaowai source code. This was my first encounter with using character entity encoding of the combining diacritical marks. Rather than type the characters directly into the source code, like this

hǎo

lostlaowai encoded the non-ascii characters like this

hǎo

even though the page encoding was declared as UTF-8

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Maybe it’s a joomla thing. lostlaowai uses joomla.

After a quick bout of deleting blocks of source code, I isolated the culprit!

screenshot
<html>
<head>
  <meta
    http-equiv="content-type"
    content="text/html; charset=utf-8">
  <title>wonderful.html</title>
</head>
<body>

<pre>
   好極了!
1. ha&#780;o ji&#769;le!
2. hǎo jíle!
<span style="font-family: Verdana,
   Arial, Helvetica, sans-serif;">
3. ha&#780;o ji&#769;le!
4. hǎo jíle!
</span><span style="font-family:
   Arial, Helvetica, sans-serif;">
5. ha&#780;o ji&#769;le!
6. hǎo jíle!
</span></pre>
</body>
</html>

Source code: wonderful.html

Adding Verdana to the font family causes the problem. I searched to see if anyone else had seen this problem. Indeed. Wikipedia.org has en entry on a similar bug and fileformat.info lists the five marks supported by Verdana. That’s sad. Verdana only supports 5 of the 112 code points in unicode’s Combining Diacritical Marks block.

The Verdana typeface, released in 1996, was created for and is owned by Microsoft. If Microsoft hasn’t fixed Verdana after more than a decade, I’ll assume they never will and prudence suggests avoid it.

At least avoid using Verdana in writing pinyin using combining diacritical marks. If you must use Verdana, then use codepoints from unicode’s latin block. On the Mac, this is the default when typing these characters in directly using the U.S. Extended keyboard.

Character     ā    á    ǎ    à
Unicode    0101 00E1 01CE 00E0
------------------------------
Character     ē    é    ě    è
Unicode    0113 00E9 0118 00E8
------------------------------
Character     ī    í    ǐ    ì
Unicode    0128 00ED 01D0 00EC
------------------------------
Character     ō    ó    ǒ    ò
Unicode    014D 00F3 0102 00F2
------------------------------
Character     ū    ú    ǔ    ù
Unicode    0168 00FA 01D4 00F9
------------------------------
Character     ǖ    ǘ    ǖ    ǜ
Unicode    01D6 01D8 01D6 01DC

If you have to convert an existing web page (like the lostlaowai page mentioned above), you could take advantage of the copy+paste quirk in OS X. Simply open the web page, copy the pinyin and paste it into a text editor (e.g., back into the source). The original text is not rendered properly but that’s ok. The character codes are correct. When you paste it into the editor, OS X will convert the the char+mark into a single char from the latin code block.

Finally, the character ‘a’ in pinyin is sometimes written using using the unicode codepoint 0251 ‘ɑ’ which is still in the latin block but in the section called “IPA Extensions”. It has a different look from the standard ascii character ‘a’. There is no set codepoints that replace the accented characters in the chart above.