Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16518

Schema Compatibility of Parquet Data Source

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.0
    • None
    • SQL

    Description

      Currently, we are not checking the schema compatibility. Different file formats behave differently. This JIRA just summarizes what I observed for parquet data source tables.

      Scenario 1 Data type mismatch:
      The existing schema is (col1 int, col2 string)
      The schema of appending dataset is (col1 int, col2 int)

      Case 1: when spark.sql.parquet.mergeSchema is false, the error we got:

      Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure:
       Lost task 0.0 in stage 4.0 (TID 4, localhost): java.lang.NullPointerException
      	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:231)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:62)
      

      Case 2: when spark.sql.parquet.mergeSchema is true, the error we got:

      Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.SparkException:
       Failed merging schema of file file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-4c2f0b69-ee05-4be1-91f0-0e54f89f2308/part-r-00000-6b76638c-a624-444c-9479-3c8e894cb65e.snappy.parquet:
      root
       |-- a: integer (nullable = false)
       |-- b: string (nullable = true)
      

      Scenario 2 More columns in append dataset:
      The existing schema is (col1 int, col2 string)
      The schema of appending dataset is (col1 int, col2 string, col3 int)

      Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
      Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int, col2 string, col3 int).

      Scenario 3 Less columns in append dataset:
      The existing schema is (col1 int, col2 string)
      The schema of appending dataset is (col1 int)

      Case 1: when spark.sql.parquet.mergeSchema is false, the schema of the resultset is (col1 int, col2 string).
      Case 2: when spark.sql.parquet.mergeSchema is true, the schema of the resultset is (col1 int).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              smilegator Xiao Li
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: