The MR job is working tremendously well for me. I'm able to almost instantly saturate my entire cluster during an upload and it remains saturated until the end. Full CPU usage, lots of io-wait, so I'm disk io-bound as I should be.
I did a few runs of a job which imported between 1M and 10M rows, each row containing a random number of columns from 1 to 1000. In the end, I imported between 500M and 5B KeyValues.
On a 5 node cluster of 2core/2gb/250gb nodes, I could import 1M rows / 500M keys in 7.5 minutes (2.2k rows/sec, 1.1M keys/sec).
On a 10 node cluster of 4core/4gb/500gb nodes, I could do the same import in 2.5 minutes. On this larger cluster I also ran the same job but with 10M rows / 5B keys in 25 minutes (6.6k rows/sec, 3.3M keys/sec).
Previously running HTable-based imports on these clusters, I was seeing between 100k and 200k keys/sec, so this represents a 5-15X speed improvement. In addition, the imports finish without any problem (I would have killed the little cluster running these imports through HBase).
I think there is a bug with the ruby script though. It worked sometimes, but other times it ended up hosing the cluster until I restarted. Things worked fine after restart.