There is a bug in the ORC deserialization code that, when triggered, results in completely wrong data being read. I've marked this as a Blocker as per the docs in https://spark.apache.org/contributing.html as it's a data correctness issue.
The bug is triggered when the following set of conditions are all met:
- the non-vectorized ORC reader is being used;
- a schema is explicitly specified when reading the ORC file
- the provided schema has columns not present in the ORC file, and these columns are in the middle of the schema
- the ORC file being read contains null values in the columns after the ones added by the schema.
When all of these are met:
- the internal state of the ORC deserializer gets messed up, and, as a result
- the null values from the ORC file end up being set on wrong columns, not the one they're in, and
- the old values from the null columns don't get cleared from the previous record.
Here's a concrete example. Let's consider the following DataFrame:
and the following schema:
Notice the `col4 int` added in the middle that doesn't exist in the dataframe.
Saving this dataframe to ORC and then reading it back with the specified schema should result in reading the same values, with nulls for `col4`. Instead, we get the following back:
Notice how the `def` from the second record doesn't get properly cleared and ends up in the third record as well; also, instead of `col2 = 9` in the last record as expected, we get the null that should've been in column 3 instead.
When this issue is triggered, it results in completely wrong results being read from the ORC file. The set of conditions under which it gets triggered is somewhat narrow so the set of affected users is probably limited. There are possibly also people that are affected but haven't realized it because the conditions are so obscure.
The issue is caused by calling `setNullAt` with a wrong index in `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for review shortly.
This bug is currently only triggered when new columns are added to the middle of the schema. This means that it can be worked around by only adding new columns at the end.