[SPARK-26859] Fix field writer index bug in non-vectorized ORC deserializer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.4, 2.4.1, 3.0.0
Component/s: SQL
Labels:
- correctness

Description

There is a bug in the ORC deserialization code that, when triggered, results in completely wrong data being read. I've marked this as a Blocker as per the docs in https://spark.apache.org/contributing.html as it's a data correctness issue.

The bug is triggered when the following set of conditions are all met:

the non-vectorized ORC reader is being used;
a schema is explicitly specified when reading the ORC file
the provided schema has columns not present in the ORC file, and these columns are in the middle of the schema
the ORC file being read contains null values in the columns after the ones added by the schema.

When all of these are met:

the internal state of the ORC deserializer gets messed up, and, as a result
the null values from the ORC file end up being set on wrong columns, not the one they're in, and
the old values from the null columns don't get cleared from the previous record.

Here's a concrete example. Let's consider the following DataFrame:

        val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), (8, 9, null)))
        val df = rdd.toDF("col1", "col2", "col3")

and the following schema:

col1 int, col4 int, col2 int, col3 string

Notice the `col4 int` added in the middle that doesn't exist in the dataframe.

Saving this dataframe to ORC and then reading it back with the specified schema should result in reading the same values, with nulls for `col4`. Instead, we get the following back:

[1,null,2,abc]
[4,null,5,def]
[8,null,null,def]

Notice how the `def` from the second record doesn't get properly cleared and ends up in the third record as well; also, instead of `col2 = 9` in the last record as expected, we get the null that should've been in column 3 instead.

Impact
When this issue is triggered, it results in completely wrong results being read from the ORC file. The set of conditions under which it gets triggered is somewhat narrow so the set of affected users is probably limited. There are possibly also people that are affected but haven't realized it because the conditions are so obscure.

Bug details
The issue is caused by calling `setNullAt` with a wrong index in `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for review shortly.

Workaround
This bug is currently only triggered when new columns are added to the middle of the schema. This means that it can be worked around by only adding new columns at the end.

Attachments

Issue Links

links to

GitHub Pull Request #23766

GitHub Pull Request #25384

Activity

People

Assignee:: Ivan Vergiliev

Reporter:: Ivan Vergiliev

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Feb/19 10:47

Updated:: 08/Aug/19 21:56

Resolved:: 20/Feb/19 14:05