HDFS applications query block location information to compute splits. One example of this is FileInputFormat:
You see bits of code like this that calculate offsets as follows:
EC confuses this since the block locations include parity block locations as well, which are not part of the logical file length. This messes up the offset calculation and thus topology/caching information too.
Applications can figure out what's a parity block by reading the EC policy and then parsing the schema, but it'd be a lot better if we exposed this more generically in BlockLocation instead.