[SPARK-31574] Schema evolution in spark while using the storage format as parquet - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

Hi Team,

Use case:

Suppose there is a table T1 with column C1 with datatype as int in schema version 1. In the first on boarding table T1. I wrote couple of parquet files with this schema version 1 with underlying file format used parquet.

Now in schema version 2 the C1 column datatype changed to string from int. Now It will write data with schema version 2 in parquet.

So some parquet files are written with schema version 1 and some written with schema version 2.

Problem statement :

1. We are not able to execute the below command from spark sql
```Alter table Table T1 change C1 C1 string```

2. So as a solution i goto hive and alter the table change datatype because it supported in hive then try to read the data in spark. So it is giving me error

```

Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)

at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)

at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)

at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)

at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:109)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)```

3. Suspecting that the underlying parquet file is written with integer type and we are reading from a table whose column is changed to string type. So that is why it is happening.

How you can reproduce this:
spark sql
1. Create a table from spark sql with one column with datatype as int with stored as parquet.
2. Now put some data into table.
3. Now you can see the data if you select from table.

Hive
1. change datatype from int to string by alter command
2. Now try to read data, You will be able to read the data here even after changing the datatype.

spark sql
1. Try to read data from here now you will see the error.

Now the question is how to solve schema evolution in spark while using the storage format as parquet.

Schema evolution in spark while using the storage format as parquet

Details

Description

Attachments

Activity

People

Dates