Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31574

Schema evolution in spark while using the storage format as parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.0
    • None
    • SQL

    Description

      Hi Team,

       

      Use case:

      Suppose there is a table T1 with column C1 with datatype as int in schema version 1. In the first on boarding table T1. I wrote couple of parquet files with this schema version 1 with underlying file format used parquet.

      Now in schema version 2 the C1 column datatype changed to string from int. Now It will write data with schema version 2 in parquet.

      So some parquet files are written with schema version 1 and some written with schema version 2.

      Problem statement :

      1. We are not able to execute the below command from spark sql
      ```Alter table Table T1 change C1 C1 string```

      2. So as a solution i goto hive and alter the table change datatype because it supported in hive then try to read the data in spark. So it is giving me error

      ```

      Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

        at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)

        at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)

        at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)

        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)

        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)

        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)

        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

        at org.apache.spark.scheduler.Task.run(Task.scala:109)

        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)```

       

      3. Suspecting that the underlying parquet file is written with integer type and we are reading from a table whose column is changed to string type. So that is why it is happening.

      How you can reproduce this:
      spark sql
      1. Create a table from spark sql with one column with datatype as int with stored as parquet.
      2. Now put some data into table.
      3. Now you can see the data if you select from table.

      Hive
      1. change datatype from int to string by alter command
      2. Now try to read data, You will be able to read the data here even after changing the datatype.

      spark sql 
      1. Try to read data from here now you will see the error.

      Now the question is how to solve schema evolution in spark while using the storage format as parquet.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sharadgupta619 sharad Gupta
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: