Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.5-alpha
-
None
-
None
-
Reviewed
Description
CombineFileInputFormat currently walks through all available nodes and generates multiple (maxSplitsPerNode) splits on a single node before attempting to generate splits on subsequent nodes. This ends up reducing the possibility of generating splits for subsequent nodes - since these blocks will no longer be available for subsequent nodes. Allowing splits to go 1 block above the max-split-size makes this worse.
Allocating a single split per node in one iteration, should help increase the distribution of splits across nodes - so the subsequent nodes will have more blocks to choose from.