[SPARK-46056] Vectorized parquet reader throws NPE when reading files with DecimalType default values - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0, 3.5.0, 4.0.0
Fix Version/s: 4.0.0
Component/s: Spark Core
Labels:
- pull-request-available

Description

The scenario is a bit more complicated than what the title says but it's not that far fetched.

Write a parquet file with one column
Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value.
Try to read the file with the new schema
NPE

The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array.

https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724

In OnHeapColumnVector which extends WritableColumVector reserveInternal() checks if the type is too wide and initializes the array elements.

https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568

isArray() returns true if the type is byteArrayDecimalType

https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945

Without the fix

[info]   Cause: java.lang.NullPointerException:
[info]   at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370)
[info]   at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611)
[info]   at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.<init>(ParquetColumnVector.java:95)
[info]   at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286)
[info]   at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293)
[info]   at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)
[info]   at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280)
[info]   at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130)
[info]   at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
[info]   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
[info]   at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
[info]   at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]   at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
[info]   at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
[info]   at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891)

fix PR https://github.com/apache/spark/pull/43960

Attachments

Issue Links

links to

[Github] Pull Request #43960 (cosmind-db)

GitHub Pull Request #43960

Activity

People

Assignee:: Cosmin Dumitru

Reporter:: Cosmin Dumitru

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Nov/23 16:52

Updated:: 30/Nov/23 02:09

Resolved:: 30/Nov/23 02:09