The reason for this is that FileInputFormat.getBlockIndex() returns the blockindex of the starting block for the given offset. Instead, it should identify all the blocks that this particular split spans and then choose the block that contributes the maximum data for this split.
We could use the following approach
if (numBlocks == 1)
else if (numBlocks == 2)
return (bytesInFirstBlock > bytesInSecondBlock) ? startIndex:startIndex+1;
return startIndex + 1;
The rationale here is that if there are more than two blocks, we are guaranteed that block 2 is contributing its entire block length for this split.
Note that we cannot do the identification of the block index based on the amount of data contributed by the individual host, because of the replication factor.
For example, consider the following example (assume dfs block size = 100)
Block 1 contributes 20 bytes and its hosts are A,B,C
Block 2 contributes 100 bytes and its hosts are A, D,E
Block 3 contributes 10 bytes and its hosts are D,E,F
If we aggregate on a per host basis, host A having contributed 120 bytes would be the ideal choice. However, if we choose Block 1 as the index to be returned, even hosts B &C would be treated as data local, which is sub optimal.