Hi Vinod, I will give you some numbers but bare in mind that these results are very initial, based only on a handful of runs on a 9 or 10 machine cluster, and without serious tuning of terasort.
The idea of the solution is for maps to write their output directly into HDFS (e.g., with replication turned down to 1). Reducers will be started only when maps complete and stream-merge straight out of HDFS (bypassing much of the partial merging logic).
Key limitations of what we have for now:
1) if a map output is lost, all reducers will have to wait for it to be re-run
2) we have lots of dfsclients open, this might become a problem for HDFS if you have too many maps per node.
We initially tried this as a way to make checkpointing cheaper (no need to save any state other than last-processed key), and we were just hoping for it not too be too much worse than regular shuffle. The surprise I mentioned above was that we actually observe a surprisingly substantial speed up on a simple sort job (on 9 nodes): 25% at 64GB scale and 31% at 1TB scale.
This seems to indicate that the penalty of reading through HDFS is actually trumped by the benefits of doing a stream-merge (where data never touch disk on the reduce side, other than for reducer output). Probably this is reducing seeks, and using the drives from which we read and we write more efficiently. You can imagine to get similar benefits by adding restartability to the http client (and the buffering done by HDFS client, which was likely to be beneficial in our test). More sophisticated versions of these could also dynamically decide whether to stream merge from a certain map or whether to copy the data (if for example they are small to fit in memory).
Bottomline, I don't think we should read to much out these results (again very initial), other than using HDFS for intermediate data layer is not completely infeasible.