bash directory crawler

Currently, popular filesystems (ext3, hfs+) have a practical limit on the number of files and directories you can store in a single directory. Certainly, most of the unix command line tools will not work once you exceed some magic threshold. In my experience, 10,000 files and or directories is the practical limit.

So what do you do when you have 1,000,000 XML files to process? I had this very problem recently. Fortunately, the problem was simplified as each file belong to one of 27,000 categories.

I organized my hierarchy into three directory levels with all the xml files in the lowest level. I then use bash to traverse the directories.

master/
  |
  +-- 0/
  |   |
  |   +-- 0/
  |   |   |
  |   |   +-- f494a6f9-fc57-4408-a637-d3b768d0cd99.xml
  |   |   |
  |   |   +-- 5be1a5ed-f159-41d1-bc2e-737b5d2bed8b.xml
  |   |   |
  |   |   +-- a4276d0f-a014-42c2-a5ec-dbf59dfee95a.xml
  |   ⋮
  |   +-- 9999/
  |
  +-- 1/
  |   |
  |   +-- 10000/
  |   ⋮
  |   +-- 19999/
  |
  +-- 2/
      |
      +-- 20000/
      ⋮
      +-- 26999/

In my problem space, I am guaranteed that each leaf directory has at least one and at most a few hundred xml files. The following script is in production use with the one exception that I’m doing more than simply counting words.

#!/bin/bash

cd /home/alice/work/master
master_directory=`pwd`

for hashed_directory in $master_directory/* ; do
  for leaf_directory in $hashed_directory/* ; do
    for xml_metadata in $leaf_directory/*.xml ; do

      # do something interesting
      cat $xml_metadata | wc

    done
  done
done

Red Leopard