I used the following command line for the RandomTextWritter job:
./hadoop jar ../hadoop-0.21.0-dev-*examples.jar randomtextwriter -D test.randomtextwrite.total_bytes=53687091200000 -D test.randomtextwrite.bytes_per_map=536870912 -D test.randomtextwrite.min_words_key=5 -D test.randomtextwrite.max_words_key=10 -D test.randomtextwrite.min_words_value=100 -D test.randomtextwrite.max_words_value=10000 -D mapred.output.compress=false -D mapred.map.output.compression.type=BLOCK -outFormat org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat /gridmix/data/WebSimulationBlock
The job has 100,000 maps and no reduces. I configured HDFS to have replication factor of 1 to eliminate network traffic. Nodes were configured to have 16 map slots and 2 reduce slots. Each task was configured to have at most 512MB of java heap space. The jobs output file is ~50TB >> overall cluster memory, forcing disk I/O
Since the job is doing no computation other than just writing to disk, one would expect that the job would be totally i/o (disk) bound. Surprisingly, it turned out to be CPU bound.
Measurements (using chukwa):
Across the cluster, workers cpu was <5% idle on average. Used disk bandwidth was ~40 MB/s across all disks for all the nodes at the cluster, which is close to the practical disk BW limit. The network is virtually 100% idle as one would expect.
The CPU time was ~70% at the user space, suggesting it's mainly overhead in the map tasks.
This suggests that there is a lot of CPU fat in the map tasks.