bash – Red Leopard

filenames with spaces

kelly — Fri, 15 Oct 2010 08:48:22 +0000

I downloaded Unity last week. My first bit of installation geekery was to push the documentation to marmaduke, my CentOS server. I drop the documentation directory into a virtual host and let apache serve the index (since I solved my Apache Directory Indexing problem a bit back.)

However, I found that the pages had missing images. Looking at the HTML source, I found that some of the image URLs used ‘images’ (lower case i) while others used ‘Images’ (upper case I). To make matters worse, CSS and javascript files were also stored in the images directory.

I wanted to simply change all occurances of Images to images. The Unity docs have a lot of files. How many? Just under four thousands.

$ find -name '*html' | wc --lines
3952

Shouldn’t be too hard of a script. However, piping the output of find didn’t work since many filenames contained spaces. Spaces in filenames cause problems.

The bash for loop treated the spaces as token markers. To solve this, I simply dumped the filenames to a tempfile and read them back in. I’m sure there’s a one liner somewhere but this solved my problem.

Solution: bash, find and perl search and replace over multiple files and the filenames contain spaces.

#!/bin/bash

find -name '*html' > filenames

while read line; do
  perl -pi -e 's/Images/images/g' "${line}"
done < filenames

rm filenames

bash uuid generator

kelly — Thu, 25 Mar 2010 04:20:50 +0000

Onliner bash scripts are handy but bash and common utilities don’t always work the same on the two systems I most use: Centos vs. OS X.

centos $ cat /etc/redhat-release 
CentOS release 5.4 (Final)

osx $ sw_vers | head -n2
ProductName:	Mac OS X
ProductVersion:	10.6.2

For example, I recently wrote a simple script to generate a set of UUID using the uuidgen utility. OS X and Centos versions of uuidgen take very different parameters.

Of course they do.

Centos uuid manpage

UUIDGEN(1)                                                 UUIDGEN(1)

NAME
       uuidgen - command-line utility to create a new UUID value

SYNOPSIS
       uuidgen [ -r | -t ]
  ...

I like to use the uuidgen -r option to explicitly generate a random-based UUID. It’s not strictly necessary as this is the default behavior. Still, I like to put it in. That’s just me. OS X doesn’t have this option. Oh, well.

OS X uuidgen manpage

UUIDGEN(1)           BSD General Commands Manual           UUIDGEN(1)

NAME
     uuidgen -- generates new UUID strings

SYNOPSIS
     uuidgen [-hdr]
  ...

Next up, OS X generates UUID in upper case whereas Centos generates UUID in lower case.

centos $ uuidgen
18722f8e-14cd-41fb-a63e-af9ff1c287ce

osx $ uuidgen
81AE9EAC-0B8B-4DB9-B262-76AA8C285DD6

Again, not really a big deal but I like consistency. Easy to fix with a pipe and tr.

osx $ uuidgen | tr [:upper:] [:lower:]
62a4d6b9-e0a9-4996-9e71-e7291158b700

But I needed a set of UUID. A simple loop would suffice.

centos $ for i in `seq 1 4`; do uuidgen | tr [:upper:] [:lower:]; done
408bf1d7-80a6-41ee-8a75-f7bbb5b65dd7
ae5e0aa4-f0b2-48ff-9cfe-ab99fb37b5c7
7e0a7e69-364d-4259-9b3f-83d448e9b591
e1d1b257-974e-4754-a6d3-fe4566b55c93

osx $ for i in `seq 1 4`; do uuidgen | tr [:upper:] [:lower:]; done
-bash: seq: command not found

Drat! No `seq 1 4` in OS X.

Okay. Use the alternate form to declare a sequence.

osx $ for i in {1..4}; do uuidgen | tr [:upper:] [:lower:]; done
c861326b-bde8-4198-b45a-6bfb7016addb
ef813568-5d3d-4587-a170-8aab798fd83b
21fe8562-1511-4fd4-bd37-71b43c32e013
acb10051-9af8-42b8-9ac9-54010ad71d07

and verifiy that it also works on Centos.

centos $ for i in {1..4}; do uuidgen | tr [:upper:] [:lower:]; done
93c68aba-cbe5-4b79-a1cc-e00eaae0527a
c564a4f4-9d39-4d2d-8762-4ba506c97de8
f694000b-d2cc-4b31-aabd-c3facd13b081
86466e00-3948-45f7-9090-09ab816b8fb6

Would ruby be easier? Probably not for this simple hack.

If I knew ruby better, dropping into irb would be just as easy as bash oneliners. But there would be other problems. For example, “Which ruby?”

centos $ ruby -v
ruby 1.9.1p376 (2009-12-07 revision 26041) [x86_64-linux]

osx $ ruby -v
ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0]

bash date tricks

kelly — Sun, 16 Aug 2009 21:14:39 +0000

A quick usage note regarding the date util under bash. I sometimes want to convert between a unix timestamp and a formatted date string. I do it infrequently enough that I forget the syntax. This article is me writing down my notes.

In the following example, I want to get timestamps and date strings for both today and yesterday. Why yesterday’s date? Because I want to get yesterday’s data from google analytics’ data API. I’ve see numerous examples getting day, month and year then subtracting one from the day and propagating the underflow through the month and year. Blech! If I have today’s timestamp, I simply subtract a days worth of seconds from today and violà, yesterday!

#!/bin/bash

# Generate a current unix timestamp
#
day=$(( `date +%s` ))

# Adjust the timestamp above by 24 hours
#
seconds_in_a_day=$(( 24 * 60 * 60 ))
yesterday=$(( day - seconds_in_a_day ))
echo "timestamps"

echo "      day : ${day}"
echo "yesterday : ${yesterday}"
echo " "

# create a formatted date string (linux)
#
#echo "linux formatted string"
#echo "      day : $( date -d @${day} '+%Y%m%d' )"
#echo "yesterday : $( date -d @${yesterday} '+%Y-%m-%dT%H:%M:%S%Z' )"

# create a formatted date string (bsd/mac)
#
echo "bsd/mac formatted string"
echo "      day : $( date -r ${day} '+%Y%m%d' )"
echo "yesterday : $( date -r ${yesterday} '+%Y-%m-%dT%H:%M:%S%Z' )"

# create a formatted date string (win)
#
# echo "windows formatted string"
# echo "windows? really?"

echo " "
echo " "

Another example? Okay, let’s say I had a text file, foo, and I wanted to embed a date and get the checksum. Furthermore, I wanted the filename to include the timestamp corresponding to the embedded date. (bsd/mac version)

#!/bin/bash

# scriptname: md5tagger
# 
# generate a timestamp,
# generate output filename 
# copy formatted date string to output file 
# cat original file to output file
# copy the md5 sum to another output file
#
day=$( date +%s )
fname="$1-${day}"

echo $( date -r ${day} ) >${fname}
cat $1 >>${fname}
md5 ${fname} >${fname}.md5

Let’s try it! (bsd/mac version)

$ printf "text to copy which\ncould be important\n" >foo
$ ./md5tagger foo

$ ll foo*
-rw-r--r--  1 kelly  kelly  38 Aug 16 13:50 foo
-rw-r--r--  1 kelly  kelly  67 Aug 16 13:51 foo-1250455870
-rw-r--r--  1 kelly  kelly  56 Aug 16 13:51 foo-1250455870.md5

$ cat foo
text to copy which
could be important

$ cat foo-1250107720
Sun Aug 16 13:51:10 PDT 2009
text to copy which
could be important

$ md5 foo-1250455870; cat foo-1250455870.md5 
MD5 (foo-1250455870) = 5cf4d9f274f05b63dfde5f15659cdeb8
MD5 (foo-1250455870) = 5cf4d9f274f05b63dfde5f15659cdeb8

With linux, you substitute md5sum for md5. Of course.

use curl for api documentation

kelly — Thu, 02 Apr 2009 18:08:03 +0000

I’ve been working quite a bit with the rest plugin for Struts2. The really nice thing about this plugin is the way it cleans up Struts URLs. Makes them more rails-like. I chuckled when depressed programmer suggested that struts2 is “WebWork on drugs.” I hate struts2. I really do.

Anyway, I have stripped down an AccountController to show just the POST service. In reality, the create() method is wired to a middle tier service that authenticates username, password pairs then updates session attributes with member id and other bits of persistent session data I need.

// imports omitted

@Results({
  @Result(
    name = "success",
    type = ServletActionRedirectResult.class,
    value = "account")
})
public class AccountController extends ActionSupport
{
  private String username;
  private String password;
  // getters/setters omitted

  public AccountController() { }

  public HttpHeaders index()   { return notImplemented(); }
  public HttpHeaders show()    { return notImplemented(); }
  public HttpHeaders edit()    { return notImplemented(); }
  public HttpHeaders editNew() { return notImplemented(); }
  public HttpHeaders update()  { return notImplemented(); }
  public HttpHeaders destroy() { return notImplemented(); }
  public HttpHeaders create()
  {
    int status = (username.equals("alice")
           && password.equals("restaurant"))
      ? HttpServletResponse.SC_ACCEPTED
      : HttpServletResponse.SC_UNAUTHORIZED;

    return new DefaultHttpHeaders().withStatus(status);
  }

  private DefaultHttpHeaders notImplemented()
  {
    return new DefaultHttpHeaders()
      .withStatus(HttpServletResponse.SC_NOT_IMPLEMENTED);
  }

}

Note that I only return HTTP headers; the body content will always be empty.

I have found curl invaluable for documenting the API. This is a simple case but consider a much more complicated system with dozens of URLs and each URL implements many of the HTTP methods (including PUT and DELETE).

Third party developers are the bane of the support engineer. First, few people read documentation. They skim the material and furiously code. When their software fails, they file a bug that the API is broken. Usually, the API isn’t broken; the developer simply did not understand the API.

I subscribe to the agile manifesto value of “working software over comprehensive documentation.” In my work, I have found that a few curl examples clears up most of these issues. For example, to exercise the create() method in the AccountController, simply post a form.

curl                                    \
  --request POST                        \
  --include                             \
  --url "http://ws.example.com/account" \
  --form "username=alice"               \
  --form "password=restaurant"          \
  --cookie-jar "cookies"                \
  --cookie "cookies"

I like to add the “–include” flag as it displays some extra header information. When I get a support call, I have the developer trot out the “documentation” curl examples and open a bash shell. This, of course, drives the Windows guys nuts–to which I reply, “buck up.” We work through the exercise of getting the http request working with the curl example. Then a miracle occurs. The developer now has a working example on their machine from which to re-examine their code.

A final note. The “–cookie-jar” and “–cookie” parameters will handle cookies between the web server and your curl commands. In otherwords, you can login to a website and these parameters will store your authenticated session id in a file. The file in this example is named “cookies” but it can be legal filename. You can then make subsequent calls to URLs, passing the cookies (and, therefore, the session id) back up to the server.

For example, to upload your avatar picture to your new social network, first login using the curl command above. This establishes an authenticated session. Then post your picture using the curl command below, making sure you pass the cookies back up.

curl                                   \
  --request POST                       \
  --include                            \
  --url "http://ws.example.com/avatar" \
  --form "avatar=@somepix.jpg"         \
  --cookie-jar "cookies"               \
  --cookie "cookies"

Finally, if you need to add a description, publish the curl command as part of a bash script. For example,

#!/bin/bash

# 1. you must login before you can upload the avatar
# 2. the web server will reject any avatar exceeding 2MB
# 3. do not forget the '@' symbol, a common mistake
# 4. do not forget to include --cookie and --cookie-jar

curl                                   \
  --request POST                       \
  --include                            \
  --url "http://ws.example.com/avatar" \
  --form "avatar=@somepix.jpg"         \
  --cookie-jar "cookies"               \
  --cookie "cookies"

Good luck!

bash progress monitor

kelly — Sat, 10 Jan 2009 05:55:28 +0000

I have a remote machine that is used to store and process XML files. Recently, I had need to duplicate a directory of XML files (e.g., cp -r a b). It’s not really germane to the subject here, but this particular server has a whack configuration and I gotta rant before I continue.

The office server (scrappy) has pretty good specs.

[scrappy ~]$ cat /proc/meminfo

MemTotal:      3980800 kB

[scrappy ~]$ cat /proc/cpuinfo

processor   : 0
model name  : Intel(R) Core(TM)2 CPU   6600  @ 2.40GHz
cpu MHz     : 2394.000
cache size  : 4096 KB

processor   : 1
model name  : Intel(R) Core(TM)2 CPU   6600  @ 2.40GHz
cpu MHz     : 2394.000
cache size  : 4096 KB

[scrappy ~]$ cat /proc/scsi/scsi

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: SONY   Model: DVD RW AW-Q170A    Rev: 1.72
  Type:   CD-ROM                           ANSI SCSI revision: 05

[scrappy ~]$ cat /proc/ide/hd?/model

ST3320620AS

Whoa! What’s my SATA drive doing attached to the IDE driver? When I compare to my home CentOS box (marmaduke), I see that its drives are connected differently. Yes, marmaduke has one HDD connected via the IDE driver (ST3320620A) but that drive is a PATA drive. The four SATA drives are connected via SATA drivers. (The SATA drives will be configured as a software RAID 10, stay tuned. There’s a xen project in the making.)

[marmaduke ~]$ cat /proc/scsi/scsi

Attached devices:
Host: scsi2 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3500630AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3300620AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi4 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3300620AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: ST3300620AS      Rev: 3.AA
  Type:   Direct-Access                    ANSI SCSI revision: 05

[marmaduke ~]$ cat /proc/ide/hd?/model

PIONEER DVD-RW DVR-111D
ST3320620A

scrappy was configured before arriving at the office by a friend of a friend who runs a PC shop. “But it was such a deal!” Yeah, right. Bunch of monkeys. How hard is it to configure the BIOS to use the SATA interface rather than the IDE interface?

Anyway, I don’t have time to rebuild scrappy right now so I live with the dismal disk performance. Here’s the problem at hand. I have numerous XML files—some largish and some smallish. I have several sets and each set has about 4000 files.

[scrappy ~]$ ls src/*xml | wc -w

4323

[scrappy ~]$ ls -l src/*xml | sort -n -r -k5

-rw-r--r-- 1 kelly kelly 315804120 Dec 19 15:46 0001.xml
-rw-r--r-- 1 kelly kelly 275651475 Dec 19 17:34 0002.xml
-rw-r--r-- 1 kelly kelly 260250994 Dec 19 16:15 0003.xml
-rw-r--r-- 1 kelly kelly 222402294 Dec 19 16:25 0004.xml
-rw-r--r-- 1 kelly kelly 204642813 Dec 19 15:52 0005.xml
     .
     .
     .
-rw-r--r-- 1 kelly kelly      1467 Dec 19 19:15 4321.xml
-rw-r--r-- 1 kelly kelly      1467 Dec 19 16:01 4322.xml
-rw-r--r-- 1 kelly kelly      1098 Dec 19 19:19 4323.xml

I wanted to duplicate the set of files as I needed to run some prototype code that I didn’t trust to be non-destructive. Simple.

[scrappy ~]$ cp -r src tgt

However, the disk performance is agonizing. So bad that I leave it while I work on another machine. But I want to know the progress and see it as it changes. With six to ten shells open, I want something that can be resized to use minimal screen real estate. I want a quick command line progress monitor.

bash to the rescue. I didn’t want to create a script file so I just jack it right into the terminal’s command line. When you open the while loop, bash will continue on the next line until you close it with the done keyword.

[scrappy ~]$ while 'true'; do
>   ts=`date`
>   src=`ls src/*xml 2>/dev/null | wc -w`
>   tgt=`ls tgt/*xml 2>/dev/null | wc -w`
>   echo -ne "  ${ts}  ${src}  ${tgt}        \r"
>   sleep 1
> done

  Fri Jan  9 15:20:17 PST 2009  4323  2304

Recall we’ve previously covered that 2>/dev/null hides the error message generated by ls if no file is found.

The components are stored in local variables as a matter of convenience and displayed using echo.

echo is passed two switches. The -n switch supresses the trailing newline so that the cursor remains on the same line as the displayed text. The -e switch causes backslashes in the text to be interpreted as the escape character. This is useful since I want to add a trailing carriage return character. This will push the cursor to the beginning of the line while remaining on the same line as the text.

After sleeping for one second, the script generates a new echo output which overwrites the old text. I suppose I could add a test to the script to break when ${src} equals ${tgt}.

I don’t know why disk I/O is so slow on scrappy. Perhaps the mode is set to use programmed I/O rather than DMA. Who knows? Who cares? Both scrappy and marmaduke have Intel ICH8 SATA controllers. scrappy has a faster processor with more cache. Yet, marmaduke smokes on disk throughput on either the SATA or IDE drives. Something is wonky.

I’d like to say that I can ignore the issue. I have way too much going on right now. But it bugs me.

grep and UTF-8

kelly — Wed, 24 Dec 2008 01:44:24 +0000

I needed to look up the various strings Apple uses to name the iTunes Library. First I tried to get name from the iTunes resource bundle

echo "this won't work..."
echo "so don't even try it"

cd /Applications/iTunes.app/Contents/Resources/English.lproj
cat Localizable.strings | grep 'PrimaryPlaylistName'

But I quickly learned that grep doesn’t work on the strings file. Why? Because Apple string files are not UTF-8. They are UTF-16. Usually. But in this case they are. I wanted to iterate over the set of resource strings and extract just the string I wanted.

First I had to convert the file from UTF-16 to another format. Really, the only format that makes sense is UTF-8. After a bit of trial and error, I finally had my script just so.

#!/bin/bash

cd /Applications/iTunes.app/Contents/Resources/

# file to look inside of
f='Localizable.strings'

# string to search for
s='PrimaryPlaylistName'

# look for directories of the form *.lproj
#
for d in `ls -1 | grep 'lproj'` ; do
  echo -n "${d}: "
  iconv -f UTF-16 -t UTF-8 ${d}/${f} | grep "${s}"
done

That’s it. That’s the script. Slap that puppy in a file (e.g., foo) and fire it off.

$ ./foo
Dutch.lproj: "kMusicLibraryPrimaryPlaylistName" = "Bibliotheek";
English.lproj: "kMusicLibraryPrimaryPlaylistName" = "Library";
French.lproj: "kMusicLibraryPrimaryPlaylistName" = "Bibliothèque";
German.lproj: "kMusicLibraryPrimaryPlaylistName" = "Mediathek";
Italian.lproj: "kMusicLibraryPrimaryPlaylistName" = "Libreria";
Japanese.lproj: "kMusicLibraryPrimaryPlaylistName" = "ライブラリ";
Spanish.lproj: "kMusicLibraryPrimaryPlaylistName" = "Biblioteca";
da.lproj: "kMusicLibraryPrimaryPlaylistName" = "Bibliotek";
fi.lproj: "kMusicLibraryPrimaryPlaylistName" = "Kirjasto";
ko.lproj: "kMusicLibraryPrimaryPlaylistName" = "보관함";
no.lproj: "kMusicLibraryPrimaryPlaylistName" = "Bibliotek";
pl.lproj: "kMusicLibraryPrimaryPlaylistName" = "Biblioteka";
pt.lproj: "kMusicLibraryPrimaryPlaylistName" = "Biblioteca";
pt_PT.lproj: "kMusicLibraryPrimaryPlaylistName" = "Biblioteca";
ru.lproj: "kMusicLibraryPrimaryPlaylistName" = "Медиатека";
sv.lproj: "kMusicLibraryPrimaryPlaylistName" = "Bibliotek";
zh_CN.lproj: "kMusicLibraryPrimaryPlaylistName" = "资料库";
zh_TW.lproj: "kMusicLibraryPrimaryPlaylistName" = "資料庫";

If I were better at command line perl, I could make a nice formatted table. That is, if I were better at perl.

centos l10n problem

kelly — Thu, 18 Dec 2008 02:46:03 +0000

Just about the time I believe the UTF-8 beast is in the cage, it escapes and runs amok.

This AM, I started to deploy an update to the webapp on EC2. Seems that some of the static strings in the app contained UTF-8 encoded non-ascii characters. The java compiler barfed. “The heck?”, I thought. I just compiled the app on my MacBook. I checked the usual suspects (tomcat’s server.xml, JAVA_OPTS) but everything looked fine. However, when I looked at the code, it was indeed mangled.

Crap! Was this a bug in CVS? (Yes, we still use CVS). Wait. What if I cut and paste the correct code from my Mac to the Centos server version. No luck. Couldn’t be vi. Trusty old vi. Could it be that Centos is confused? Let’s look:

$ locale
LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

What the…?

I don’t know what I did but when I created my ec2 image, I must have omitted a step. None of the googled web-geniuses had solved this exact problem but it seems everyone flails about with LANG environment variable.

export LANG=en_US.UTF-8

That did the trick!

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

A fresh cvs checkout and I was back in business. I don’t feel I completely understand Centos localization configuration. At least I’m aware of it, now.

bash array crawler

kelly — Thu, 18 Dec 2008 02:18:06 +0000

I wanted to complement my bash directory crawler post with a bash array crawler example.

Sometimes, it’s easier to jack a list of identifying tokens into an array and process them rather than to build an end-to-end script with database access. For this contrived example, I grab a list of UUID from MySQL with a simple SQL statement.

mysql> SELECT id, uuid FROM icons;
+-----+--------------------------------------+
| id  | uuid                                 |
+-----+--------------------------------------+
|   1 | fe0b16ed-3369-4dda-8e60-faffb966375d |
|   3 | 82bfcbc2-84a2-4ca7-914b-13172b94feb6 |
|   6 | ab5e7265-3698-4205-b081-e6aec528fee2 |
|  11 | 4b6ca26b-c6ed-494f-aeb4-9bf369e2d465 |
|  19 | e7cc807b-7f15-46fa-b1c5-85d1f1050155 |
+-----+--------------------------------------+
5 rows in set (0.00 sec)

Next, jack the tokens into an array and simply crawl over the tokens.

#!/bin/bash

uuids=(
 fe0b16ed-3369-4dda-8e60-faffb966375d
 82bfcbc2-84a2-4ca7-914b-13172b94feb6
 ab5e7265-3698-4205-b081-e6aec528fee2
 4b6ca26b-c6ed-494f-aeb4-9bf369e2d465
 e7cc807b-7f15-46fa-b1c5-85d1f1050155
)

for uuid in ${uuids[@]} ; do

  # do something interesting here
  echo "http://icons.example.com/${uuid}.jpg"

  # curl 
  #   --request GET 
  #   --remote-name 
  #   --url "http://icons.example.com/${uuid}.jpg"

done

EC2 and S3 Success Story

kelly — Thu, 11 Dec 2008 02:09:36 +0000

I’ve been building systems lately on Amazon’s Elastic Compute Cloud (EC2). At first, I was only interested in Amazon’s Simple Storage Solution (S3) after seeing the SmugMug slide show.

I hadn’t really considered using EC2 since we had more servers in colocation than I really needed. But I had a file storage problem. When you have a thousand files, you stick them in a directory. When you have a million files, you cannot simply stick them in a single directory. You distribute them across multiple directories. What a PITA.

My first thought was to use MogileFS. It handles the directory hashing for you and distributes redundant copies of files across multiple servers. I had extra servers. Sweet. But before I rushed off and started building my shiny new filesystem, I wanted to check out the competitors. That led me to SmugMug. And that led me to S3.

I work at a tiny startup. I had a problem and very few developers to ask for help. Every hour I needed from was a significant impact on another project. And dammit, all the open projects were on fire. I needed to solve my file system problem and fast.

So up on S3 the files went. XML files. Beaucoup XML files.

It was painless. It was simple. It was cheap. The monthly S3 cost is a fraction of a server’s cost in colocation. Sweet!

Wait! If that’s so yummy, why not move XML processing up to EC2? Our XML processing load was increasing…increasingly increasing. I rewrote our XML processing app, built a custom amazon machine image (centos + apache + tomcat) and fired it up. Nice!

Building the machine instance was a pain but worth the effort. I learned a lot about centos that I didn’t previously know or really understand. However, I wish I had a real system administrator on staff. It would have hurt less.

One of the goals for the EC2-based XML processing was to shift from offline XML processing to a RESTful web service. That is, rather than queue the XML processing in a single process, I needed to finish the XML processing during the HTTP request. On demand processing. Done in seconds (not tens of minutes). And handle multiple concurrent processing requests.

Here is the EC2 <--> S3 connection. For each file received for processing, I write dozens to hundreds of files to S3 plus open scads of HTTP connections to other web servers. Running these in a single thread burned precious time. Even though we “write” to S3, the underlying mechanism is another HTTP request.

Simple. Build a thread pool for the HTTP requests and run multiple threads concurrently. That worked swimmingly but for one issue. It didn’t take long until I started seeing the “Too many open files” in the exception logs.

Normally, the limit on open files is quite adequate. But you bolt Apache’s HttpClient to the backend of your webapp and supercharge it with a healthy thread pool and you will overwhelm the default settings. Centos will not “garbage collect” the spent files from completed HTTP requests fast enough.

The solution: Up the limits on open files. The default is 1024. Simply edit /etc/security/limits.conf and change the soft and hard values for nofile. I’m sure there is a maximum size but these values have been working for me. What’s appropriate for your system is dependent on your system. You will need to pick size values for yourself.

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4
*                soft    nofile          8192
*                hard    nofile          65536

What was the net result of moving XML processing and storage up to the Amazon Cloud? Retired 60% of the servers in colocation. Built a scalable infractructure. Reduced overall monthly hosting costs. Fewer moving parts.

Now, if only I had a system administrator…

bash directory crawler

kelly — Thu, 04 Dec 2008 19:44:21 +0000

Currently, popular filesystems (ext3, hfs+) have a practical limit on the number of files and directories you can store in a single directory. Certainly, most of the unix command line tools will not work once you exceed some magic threshold. In my experience, 10,000 files and or directories is the practical limit.

So what do you do when you have 1,000,000 XML files to process? I had this very problem recently. Fortunately, the problem was simplified as each file belong to one of 27,000 categories.

I organized my hierarchy into three directory levels with all the xml files in the lowest level. I then use bash to traverse the directories.

master/
  |
  +-- 0/
  |   |
  |   +-- 0/
  |   |   |
  |   |   +-- f494a6f9-fc57-4408-a637-d3b768d0cd99.xml
  |   |   |
  |   |   +-- 5be1a5ed-f159-41d1-bc2e-737b5d2bed8b.xml
  |   |   |
  |   |   +-- a4276d0f-a014-42c2-a5ec-dbf59dfee95a.xml
  |   ⋮
  |   +-- 9999/
  |
  +-- 1/
  |   |
  |   +-- 10000/
  |   ⋮
  |   +-- 19999/
  |
  +-- 2/
      |
      +-- 20000/
      ⋮
      +-- 26999/

In my problem space, I am guaranteed that each leaf directory has at least one and at most a few hundred xml files. The following script is in production use with the one exception that I’m doing more than simply counting words.

#!/bin/bash

cd /home/alice/work/master
master_directory=`pwd`

for hashed_directory in $master_directory/* ; do
  for leaf_directory in $hashed_directory/* ; do
    for xml_metadata in $leaf_directory/*.xml ; do

      # do something interesting
      cat $xml_metadata | wc

    done
  done
done