Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26859

Fix field writer index bug in non-vectorized ORC deserializer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.4, 2.4.1, 3.0.0
    • SQL

    Description

      There is a bug in the ORC deserialization code that, when triggered, results in completely wrong data being read. I've marked this as a Blocker as per the docs in https://spark.apache.org/contributing.html as it's a data correctness issue.

      The bug is triggered when the following set of conditions are all met:

      • the non-vectorized ORC reader is being used;
      • a schema is explicitly specified when reading the ORC file
      • the provided schema has columns not present in the ORC file, and these columns are in the middle of the schema
      • the ORC file being read contains null values in the columns after the ones added by the schema.

      When all of these are met:

      • the internal state of the ORC deserializer gets messed up, and, as a result
      • the null values from the ORC file end up being set on wrong columns, not the one they're in, and
      • the old values from the null columns don't get cleared from the previous record.

      Here's a concrete example. Let's consider the following DataFrame:

              val rdd = sparkContext.parallelize(Seq((1, 2, "abc"), (4, 5, "def"), (8, 9, null)))
              val df = rdd.toDF("col1", "col2", "col3")
      

      and the following schema:

      col1 int, col4 int, col2 int, col3 string
      

      Notice the `col4 int` added in the middle that doesn't exist in the dataframe.

      Saving this dataframe to ORC and then reading it back with the specified schema should result in reading the same values, with nulls for `col4`. Instead, we get the following back:

      [1,null,2,abc]
      [4,null,5,def]
      [8,null,null,def]
      

      Notice how the `def` from the second record doesn't get properly cleared and ends up in the third record as well; also, instead of `col2 = 9` in the last record as expected, we get the null that should've been in column 3 instead.

      Impact
      When this issue is triggered, it results in completely wrong results being read from the ORC file. The set of conditions under which it gets triggered is somewhat narrow so the set of affected users is probably limited. There are possibly also people that are affected but haven't realized it because the conditions are so obscure.

      Bug details
      The issue is caused by calling `setNullAt` with a wrong index in `OrcDeserializer.scala:deserialize()`. I have a fix that I'll send out for review shortly.

      Workaround
      This bug is currently only triggered when new columns are added to the middle of the schema. This means that it can be worked around by only adding new columns at the end.

      Attachments

        Activity

          People

            ivan.vergiliev Ivan Vergiliev
            ivan.vergiliev Ivan Vergiliev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: