Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.1.0
-
None
-
None
Description
The current default split size is the size of a block (32M) and a SequenceFile sets it to be SequenceFile.SYNC_INTERVAL(2K). We currently have a Map/Reduce application working on crawled docuements. Its input data consists of 356 sequence files, each of which is of a size around 30G. A jobtracker takes forever to launch the job because it needs to generate 356*30G/2K map tasks!
The proposed solution is to let the minimum split size configurable so that the programmer can control the number of tasks to generate.