[SPARK-35461] Error when reading dictionary-encoded Parquet int column when read schema is bigint - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

When reading a dictionary-encoded integer column from a Parquet file, and users specify read schema to be bigint, Spark currently will fail with the following exception:

java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
	at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50)
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344)

To reproduce:

    val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, i.toString))
    withParquetFile(data) { path =>
      val readSchema = StructType(Seq(StructField("_1", LongType)))
      spark.read.schema(readSchema).parquet(path).first()
    }

Attachments

Issue Links

Blocked

SPARK-36990 Long columns cannot read columns with INT32 type in the parquet file

Resolved

relates to

SPARK-23007 Add schema evolution test suite for file-based data sources

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Chao Sun

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/May/21 18:06

Updated:: 13/Oct/21 23:03