Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0-alpha1
-
None
Description
HDFS applications query block location information to compute splits. One example of this is FileInputFormat:
You see bits of code like this that calculate offsets as follows:
long bytesInThisBlock = blkLocations[startIndex].getOffset() + blkLocations[startIndex].getLength() - offset;
EC confuses this since the block locations include parity block locations as well, which are not part of the logical file length. This messes up the offset calculation and thus topology/caching information too.
Applications can figure out what's a parity block by reading the EC policy and then parsing the schema, but it'd be a lot better if we exposed this more generically in BlockLocation instead.