Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.20.1
-
None
-
None
Description
HADOOP-3638 improved the merge of the map output by caching the index files.
But why not also caching the data files?
In our use-case scenario, still using hadoop-0.18.3, but HADOOP-3638 would only help partially, an individual map tasks finishes in less than 30 minutes, but needs 4 hours to merge 70 spills for 20,000 partitions (with lzo compression), reading about 10kB from each spill file (which is re-opened for every partition). As this is just a merge sort, there is no reason to not keep the input files open and eliminate seek altogether with sequential access.