[SPARK-39926] Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: SQL
Labels:
None

Description

How to reproduce:

set spark.sql.parquet.enableVectorizedReader=false;
create table t(a int) using parquet;
insert into t values (42);
alter table t add column b int default 42;
insert into t values (43, null);
select * from t;

This should return two rows:

(42, 42) and (43, NULL)

But instead the scan misses the inserted NULL value, and returns the existence DEFAULT value of "42" instead:

(42, 42) and (43, 42).

This bug happens because the Parquet API calls one of these set* methods in ParquetRowConverter.scala whenever it finds a non-NULL value:

private class RowUpdater(row: InternalRow, ordinal: Int)
extends ParentContainerUpdater {
  override def set(value: Any): Unit = row(ordinal) = value
  override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, value)
  override def setByte(value: Byte): Unit = row.setByte(ordinal, value)
  override def setShort(value: Short): Unit = row.setShort(ordinal, value)
  override def setInt(value: Int): Unit = row.setInt(ordinal, value)
  override def setLong(value: Long): Unit = row.setLong(ordinal, value)
  override def setDouble(value: Double): Unit = row.setDouble(ordinal, value)
  override def setFloat(value: Float): Unit = row.setFloat(ordinal, value)
}

But it never calls anything like "setNull()" when encountering a NULL value.

To fix the bug, we need to know how many columns of data were present in each row of the Parquet data, so we can differentiate between a NULL value and a missing column.

Attachments

Issue Links

links to

[Github] Pull Request #37501 (dtenedor)

Activity

People

Assignee:: Daniel

Reporter:: Daniel

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jul/22 21:49

Updated:: 13/Aug/22 17:47

Resolved:: 13/Aug/22 17:47