Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16518

Schema Compatibility of Parquet Data Source

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      Currently, we are not checking the schema compatibility. Different file formats behave differently. This JIRA just summarizes what I observed for parquet data source tables.

      Scenario 1 Data type mismatch:
      The existing schema is (col1 int, col2 string)
      The schema of appending dataset is (col1 int, col2 int)

      Case 1: when spark.sql.parquet.mergeSchema is false, the error we got:

      Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure:
       Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException
      	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62)
      

      Case 2: when spark.sql.parquet.mergeSchema is true, the error we got:

      Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException:
       Failed merging schema of file file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-00000-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet:
      root
       |-- a: integer (nullable = false)
       |-- b: string (nullable = true)
      

      Scenario 2 More columns in append dataset:
      The existing schema is (col1 int, col2 string)
      The schema of appending dataset is (col1 int, col2 string, col3 int)

      Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
      Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int, col2 string, col3 int).

      Scenario 3 Less columns in append dataset:
      The existing schema is (col1 int, col2 string)
      The schema of appending dataset is (col1 int)

      Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
      Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                smilegator Xiao Li
              • Votes:
                2 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: