Red Leopard

2008-12-12 / kelly / No Comment

WordPress plugin problem

When I moved to wordpress, I purposely picked a theme with paged listings. Of course, I also started with the latest v. 2.7 release. (Actually 2.7 rc2, then an upgrade).

The paged navigation at the bottom of the content section did not work. I tried to install the wp-pagenavi plugin but wordpress complained that the anjing theme had already defined wp_pagenavi().

I’m new to wordpress (and don’t know PHP well) but I do know how to comment out code.

vi ~/public_html/wp-content/themes/anjing/functions.php

and comment out the function

/*
function wp_pagenavi($before = '', $after = '') {
    ⋮
}
*/

Like magic. The plugin activates and I have pages!

2008-12-11 / kelly / No Comment

Character codes and encoding

In the beginning, there was ASCII. (There were others, but we begin here with ASCII).

7-bit ASCII in an 8-bit package. Using only the first seven bits of a byte, standard ASCII could not deal with diacritical marks (accents and funny dots in the vulgar vernacular). Therefore, the German word for later, später, would sometimes be transliterated into spaeter. I suspect this started in the days before computers when typewriters had only so many keys. You had to make do.

The computer age began before most of us were born. And with it was a need to encode characters into bits. And since there exists more than the simple “unaccented” latin characters in the world, other standards emerged. However, most surviving systems use the basic 7-bit ASCII characters as their starting point.

Listing 1. später transliterated into spaeter

Name:   s       p       a       e       t       e       r
              -----   -----   -----   -----   -----   -----   -----
      UTF-16: 00 73   00 70   00 61   00 65   00 74   00 65   00 72
       UTF-8:    73      70      61      65      74      65      72
  ISO-8859-1:    73      70      61      65      74      65      72
Windows-1252:    73      70      61      65      74      65      72
Mac OS Roman:    73      70      61      65      74      65      72

ISO-8859-1, Windows-1252 and Mac OS Roman are all 8-bit code tables. That is, every character defined in these standards is represented by 8-bits. These represent characters used by most of the western European languages.

In contrast, UTF-16 is represented by two bytes and UTF-8 is represented by one or more bytes.

Basic ASCII uses only the lower 127 code-points in a byte (ergo, the reference to 7-bit ASCII). But there are 8-bits in a byte. That leaves the upper half of the code-points open to accomodate other characters.

ISO-8859-1 and Windows-1252 have the same encoding for ä but Mac OS Roman does not. This historical reality has caused problems since the early DOS/MAC days.

UTF-8 and UTF-16 are encoding schemes. Whereas ISO-8859-1, Windows-1252 and Mac OS Roman have character code-points and character encodings that coincide (are the same), the UTF encoding schemes are simply character encoding schemes for Unicode.

Unicode?

Unicode defines characters in a mega table. Most characters in Unicode are defined by a 16-bit code point. Not surprisingly, UTF-16 character encoding is very often the same as the Unicode code-point. UTF-8, on the other hand, is not. Notice that both UTF-8 and UTF-16 use two bytes to encode ä but that the encoding is different.

Listing 2. später using a single code-point for ä

        Name:   s       p        ä       t       e       r
              -----   -----   -----   -----   -----   -----
      UTF-16: 00 73   00 70   00 E4   00 74   00 65   00 72
       UTF-8:    73      70   C3 A4      74      65      72
  ISO-8859-1:    73      70      E4      74      65      72
Windows-1252:    73      70      E4      74      65      72
Mac OS Roman:    73      70      A8      74      65      72

In addition to character code-points with “accents”, Unicode also defines diacritical marks. Some diacritics are combining and some are not.

Diacritic what?

Diacritics are “accents” which are defined separately from the actual character. When a character is preceded or followed by a combining diacritical mark, the two are combined to form a single character. In the case of ä, the character ‘a’ plus the diacritical combining mark ‘¨’ are combined by the rendering engine (e.g., the browser) into a single character ‘ä’. (See also, glyph).

Not all diacritics are “accents” nor are they combining. The apostrophe is considered a non-combining diacritical mark and is not an “accent”. For example, the apostrophe in “I have Carol’s book” is used to denote possesion rather than pronuciation.

The 8-bit code tables do not have a separate code-point for ‘¨’ and cannot represent it. MySQL’s default character encoding when creating tables is ISO-8859-1. You can change the defaults but out of the box, it’s ISO-8859-1. Most XML I deal with is UTF-8 encoded.

If you have a really clever conversion routine, you can convert später from UTF-8 into ISO-8859-1. I don’t recommend it. Eventually, you will run into characters that simply cannot be mapped into the 8-bit code space of ISO-8859-1. How prevalent is this problem? Consider the musicians Sinéad O’Connor, Björk, and 松居和.

Notice, again, the difference between UTF-8 and UTF-16 character encodings.

Listing 3. später using a combinine diacritical mark, a + ¨ --> ä

        Name:   s       p           ä           t       e       r
              -----   -----   -------------   -----   -----   -----
      UTF-16: 00 73   00 70   00 61   03 08   00 74   00 65   00 72
       UTF-8:    73      70      61   CC 88      74      65      72
  ISO-8859-1:    73      70      61     ?        74      65      72
Windows-1252:    73      70      61     ?        74      65      72
Mac OS Roman:    73      70      61     ?        74      65      72

There are other 8-bit code tables for many of the phonetic alphabets. The english word ‘later’ translated into Russian is ‘позже’. Each 8-bit code table is tied to a specific alphabet.

List of ISO-8859 code tables.

ISO-8859-1  Latin-1, Western European
ISO-8859-2  Latin-2, Central European
ISO-8859-3  Latin-3, South European
ISO-8859-4  Latin-4, North European
ISO-8859-5  Latin/Cyrillic
ISO-8859-6  Latin/Arabic
ISO-8859-7  Latin/Greek
ISO-8859-8  Latin/Hebrew
ISO-8859-9  Latin-5, Turkish
ISO-8859-10 Latin-6, Nordic
ISO-8859-11 Latin/Thai
ISO-8859-12 -- abandoned --
ISO-8859-13 Latin-7, Baltic Rim
ISO-8859-14 Latin-8, Celtic
ISO-8859-15 Latin-9
ISO-8859-16 Latin-10, South-Eastern European

In each of these ISO-8859 tables, the first 127 characters are identical with 7-bit ASCII.

There are separate code tables for ISO-8859, Windows and Mac since ‘позже’ cannot be represented in their Latin-1 equivalents.

Listing 4. позже

        Name:   п       о       з       ж       е
              -----   -----   -----   -----   -----
      UTF-16: 04 3F   04 3E   04 37   04 36   04 35
       UTF-8: D0 BF   D0 BE   D0 B7   D0 B6   D0 B5
  ISO-8859-5:    DF      DE      D7      D6      D5
Windows-1251:    EF      EE      E7      E6      E5
Mac-Cyrillic:    EF      EE      E7      E6      E5

How, then, can you encode a document using one of the ISO-8859 tables that contains the three words ‘later’, ‘später’ and ‘позже’?

You can’t. Not with the 8-bit tables. But Unicode is different. It is a unifying code table with more than 100,000 characters.

The 8-bit tables combine both the code-point and the character encoding. That is, the code point for the letter ‘a’ in ISO-8859-1 is 0x61 and the character encoding for ‘a’–the bit pattern used to represent ‘a’ in the computer–is also 0x61.

Unicode is not a character encoding. It is strictly a character code table. Therefore, Unicode needs an encoding scheme to represent the character as a bit pattern. Since the code table is not tied to any character encoding, any suitable encoding scheme may be used. Many have emerged, most notably UTF-8 and UTF-16.

Eventually, you run into a scenario where none of the ISO-8859 tables will work for you. Consider devanagari, the script used in Hindi. The ISO-8859 table for devanagari was abandoned and India uses their own table, ISCII. (Like ASCII but with an ‘I’). The Hindi word for later is ‘बाद में’. Devanagari makes liberal use of diacritical marks.

Listing 5. बाद में 
Name:      ब         ा          द                          में 
        --------  --------  --------  -----  ----------------------------
UTF-16:    09 2C     09 3E     09 26  00 20     09 2E     09 47     09 02
 UTF-8: E0 A4 AC  E0 A4 BE  E0 A4 A6     20  E0 A4 AE  E0 A5 87  E0 A4 82
 ISCII:       DA        EA        D4     20        DC        F1        B2

By now, you get the point. If you must deal with various character sets, Unicode is your friend. That’s great. Wow! All those phonetic alphabets in one table. “I bet this works great for Chinese character”, you say? It does but it has yet to catch on in the Chinese speaking world.

Chinese characters come in two flavors: Traditional and Simplified. The english word ‘later’ is 稍後 in traditional characters and 稍后 in simplified characters. It is the same word, the same pronuciation but a different rendering.

Chinese speaking countries have adopted either Traditional or Simplified characters as their national standard. Many characters are visually the same in both standards (e.g., 稍) while others are not (e.g., 後 and 后). And there are numerous character encoding schemes that exist solely for the Chinese character sets. Two standards emerge as the leaders.

Traditional character countries gravitate towards Big5 while Simplified character countries lean towards GB2312. These are fixed width, 16-bit code tables.

Notice, however, that both Big5 and GB2312 have code points for and character encodings for both traditional and simplified characters.

Listing 6. 稍後

  Name:    稍          後
        --------   --------
UTF-16:    7A 0D      5F 8C
 UTF-8: E7 A8 8D   E5 BE 8C
  Big5:    B5 78      AB E1
GB2312:    C9 D4      E1 E1

Listing 7. 稍后

  Name:    稍          后
        --------   --------
UTF-16:    7A 0D      54 0E
 UTF-8: E7 A8 8D   E5 90 8E
  Big5:    B5 78      A6 59
GB2312:    C9 D4      BA F2

You will find that UTF-8 character encoding for Chineses characters generally require three bytes (or more) whereas Big5 and GB2312 require only two. This becomes a serious issue for web application internationalization (i18n). If you create MySQL tables to use UTF-8, you will need to translate any differing character encoding into UTF-8. This isn’t so much a problem with XML (as it is usually UTF-8, in my experience) or with your own website (as you have control over the front end.

Lastly, Unicode is not a 16-bit table. It’s bit independent. When more space is needed, the Unicode folks will add more table space. For example, the Unicode code points for Tai Xuan Jing Symbols (under Miscellaneous Symbols and Dingbats) define a code-point for the symbol for eternity at U+1D33A. The system fonts for Macintosh include a rendering for this symbol. I don’t know if they exist on other platforms so I include a picture of it.

Listing 8. (TETRAGRAM FOR ETERNITY)

UTF-16: D8 34 DF 3A
 UTF-8: F0 9D 8C BA

And there you have it. Character codes and encoding. But this isn’t the end of the story. To complete the round trip from web page to database to web page, you will need to contend with URL encoding and configuring your webapp. But that will need to wait for another day.

2008-12-10 / kelly / No Comment

EC2 and S3 Success Story

I’ve been building systems lately on Amazon’s Elastic Compute Cloud (EC2). At first, I was only interested in Amazon’s Simple Storage Solution (S3) after seeing the SmugMug slide show.

I hadn’t really considered using EC2 since we had more servers in colocation than I really needed. But I had a file storage problem. When you have a thousand files, you stick them in a directory. When you have a million files, you cannot simply stick them in a single directory. You distribute them across multiple directories. What a PITA.

My first thought was to use MogileFS. It handles the directory hashing for you and distributes redundant copies of files across multiple servers. I had extra servers. Sweet. But before I rushed off and started building my shiny new filesystem, I wanted to check out the competitors. That led me to SmugMug. And that led me to S3.

I work at a tiny startup. I had a problem and very few developers to ask for help. Every hour I needed from was a significant impact on another project. And dammit, all the open projects were on fire. I needed to solve my file system problem and fast.

So up on S3 the files went. XML files. Beaucoup XML files.

It was painless. It was simple. It was cheap. The monthly S3 cost is a fraction of a server’s cost in colocation. Sweet!

Wait! If that’s so yummy, why not move XML processing up to EC2? Our XML processing load was increasing…increasingly increasing. I rewrote our XML processing app, built a custom amazon machine image (centos + apache + tomcat) and fired it up. Nice!

Building the machine instance was a pain but worth the effort. I learned a lot about centos that I didn’t previously know or really understand. However, I wish I had a real system administrator on staff. It would have hurt less.

One of the goals for the EC2-based XML processing was to shift from offline XML processing to a RESTful web service. That is, rather than queue the XML processing in a single process, I needed to finish the XML processing during the HTTP request. On demand processing. Done in seconds (not tens of minutes). And handle multiple concurrent processing requests.

Here is the EC2 <--> S3 connection. For each file received for processing, I write dozens to hundreds of files to S3 plus open scads of HTTP connections to other web servers. Running these in a single thread burned precious time. Even though we “write” to S3, the underlying mechanism is another HTTP request.

Simple. Build a thread pool for the HTTP requests and run multiple threads concurrently. That worked swimmingly but for one issue. It didn’t take long until I started seeing the “Too many open files” in the exception logs.

Normally, the limit on open files is quite adequate. But you bolt Apache’s HttpClient to the backend of your webapp and supercharge it with a healthy thread pool and you will overwhelm the default settings. Centos will not “garbage collect” the spent files from completed HTTP requests fast enough.

The solution: Up the limits on open files. The default is 1024. Simply edit /etc/security/limits.conf and change the soft and hard values for nofile. I’m sure there is a maximum size but these values have been working for me. What’s appropriate for your system is dependent on your system. You will need to pick size values for yourself.

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4
*                soft    nofile          8192
*                hard    nofile          65536

What was the net result of moving XML processing and storage up to the Amazon Cloud? Retired 60% of the servers in colocation. Built a scalable infractructure. Reduced overall monthly hosting costs. Fewer moving parts.

Now, if only I had a system administrator…

2008-12-10 / kelly / One Comment

Actionscript GZIP Alternative

I really wish Actionscript 3 had a native decompression utility for opening GZIP compressed files. I really do. After reading (and trying to implement the advice of) numerous postings, I gave up on GZIP.

In my investigation, I walked byte-by-byte through numerous binary dumps. Somewhere along the way I noticed a pattern. Forget about the head and foot bytes, the basic GZIP compressed data is simply not the same as the actionscript base deflate-algorithm compressed data.

People argue with me on that one. They say the compressed bytes are simply deflate-algorithm compressed and that these bytes can be decompressed by actionscript. I have but one comment. “Good luck with that, Cowboy. Knock yourself out.”

I have working code in production. Who you gonna believe?

I’ll start with the flex side first (flex3, as3). Use a URLLoader and set the dataFormat to BINARY or it will not work. Since the source data is not GZIP, I’ve made up a new file extension ‘xmlz’.

The code in listing 1 fetches the compressed XML, decompresses it and dumps it in a TextArea.

Listing 1. Simple Flex app to load compressed XML

<?xml version="1.0" encoding="utf-8"?>
<mx:Application
    xmlns:mx="http://www.adobe.com/2006/mxml"
    layout="vertical"
    backgroundGradientColors="[#ffffff, #c0c0c0]"
    width="100%"
    height="100%">

  <mx:Script>
  <![CDATA[
    import flash.utils.ByteArray;

    [Bindable]
    private var bytes:String = "";

    private var loader:URLLoader = new URLLoader();

    private function urlLoaderSend():void
    {
      loader.addEventListener(Event.COMPLETE, setResult);
      loader.dataFormat = URLLoaderDataFormat.BINARY;
      loader.load(new URLRequest("http://example.com/data.xmlz"));
    }

    private function setResult(event:Event):void
    {
      var ba:ByteArray = loader.data as ByteArray;
      ba.position = 0;
      ba.uncompress();
      bytes = ba.toString();
    }
  ]]>
  </mx:Script>

  <mx:Button
      label="get playlist"
      click="urlLoaderSend()" />
  <mx:TextArea
      id="resultTA"
      width="600"
      height="600"
      text="{bytes}" />
</mx:Application>

The server-side java is quite simple. I use the StAX XMLStreamWriter to generate XML. The writer’s constructor takes an OutputStream as an argument. Simply wrap the original OutputStream in a DeflaterOutputStream. The trick is to explicitly create the Deflater. If you don’t, the default will produce bytes incompatible with actionscript’s uncompress().

Listing 2. Java DeflaterOutputStream fragment (with StAX writer)

OutputStream out = new OutputStream();
    ⋮

Deflater d = new Deflater(Deflater.BEST_COMPRESSION, false);
OutputStream zipper = new DeflaterOutputStream(out, d);

XMLOutputFactory factory = XMLOutputFactory.newInstance();

XMLStreamWriter writer;
writer = factory.createXMLStreamWriter(zipper, "utf-8");
    ⋮

And that’s it! The compression is size-equivalent with GZIP (within a few bytes). I would rather actionscript supported GZIP. Until it does, I’ll continue using ‘xmlz’.

2008-12-07 / kelly / No Comment

The Jump to WordPress

I finally made the jump from moveable type to wordpress. The hardest step was porting the old posts. Whereas I created posts using UTF-8 encoded characters, moveable type did a dumb convert of the characters into Latin-1 when storing them MySQL. Since the path out of the database and to the page was a reverse process, when viewing pages the characters would reappear as UTF-8.

However, when exporting the data from moveable type, the round-trip was broken. The characters in MySQL were not really proper Latin-1 characters. None of the recommended processes I found through google worked for me. I’ve had to deal with this problem at work last year. The memory of that pain makes me flinch.

In the end, I exported the data from moveable type, ran iconv to convert to UTF-8 and manually edited the resulting file to correct the mistakes.

I chose not to import all my images into each and every post. I simply link to them.

I still don’t have my template set up properly. There’s a few bits and pieces from the moveable type templates I’d like to port. But overall, it went smoothly. Five minutes to install and configure wordpress/mysql. Twenty minutes to find a template I liked. Ten minutes tweaking the CSS to fit my old content into the new template (images were wider than the default content width). Three hours mucking with the Latin-1::UTF-8 problem.

Now to get pagination, rss/atom feeds, … working.

2008-12-04 / kelly / One Comment

bash directory crawler

Currently, popular filesystems (ext3, hfs+) have a practical limit on the number of files and directories you can store in a single directory. Certainly, most of the unix command line tools will not work once you exceed some magic threshold. In my experience, 10,000 files and or directories is the practical limit.

So what do you do when you have 1,000,000 XML files to process? I had this very problem recently. Fortunately, the problem was simplified as each file belong to one of 27,000 categories.

I organized my hierarchy into three directory levels with all the xml files in the lowest level. I then use bash to traverse the directories.

master/
  |
  +-- 0/
  |   |
  |   +-- 0/
  |   |   |
  |   |   +-- f494a6f9-fc57-4408-a637-d3b768d0cd99.xml
  |   |   |
  |   |   +-- 5be1a5ed-f159-41d1-bc2e-737b5d2bed8b.xml
  |   |   |
  |   |   +-- a4276d0f-a014-42c2-a5ec-dbf59dfee95a.xml
  |   ⋮
  |   +-- 9999/
  |
  +-- 1/
  |   |
  |   +-- 10000/
  |   ⋮
  |   +-- 19999/
  |
  +-- 2/
      |
      +-- 20000/
      ⋮
      +-- 26999/

In my problem space, I am guaranteed that each leaf directory has at least one and at most a few hundred xml files. The following script is in production use with the one exception that I’m doing more than simply counting words.

#!/bin/bash

cd /home/alice/work/master
master_directory=`pwd`

for hashed_directory in $master_directory/* ; do
  for leaf_directory in $hashed_directory/* ; do
    for xml_metadata in $leaf_directory/*.xml ; do

      # do something interesting
      cat $xml_metadata | wc

    done
  done
done

Red Leopard

WordPress plugin problem

Character codes and encoding

EC2 and S3 Success Story

Actionscript GZIP Alternative

The Jump to WordPress

bash directory crawler

Recent Posts

Meta

Archives

Red Leopard

WordPress plugin problem

Character codes and encoding

EC2 and S3 Success Story

Actionscript GZIP Alternative

The Jump to WordPress

bash directory crawler

Recent Posts

Meta

Tags

Archives