Description
How to reproduce:
set spark.sql.parquet.enableVectorizedReader=false; create table t(a int) using parquet; insert into t values (42); alter table t add column b int default 42; insert into t values (43, null); select * from t;
This should return two rows:
(42, 42) and (43, NULL)
But instead the scan misses the inserted NULL value, and returns the existence DEFAULT value of "42" instead:
(42, 42) and (43, 42).
This bug happens because the Parquet API calls one of these set* methods in ParquetRowConverter.scala whenever it finds a non-NULL value:
private class RowUpdater(row: InternalRow, ordinal: Int) extends ParentContainerUpdater { override def set(value: Any): Unit = row(ordinal) = value override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, value) override def setByte(value: Byte): Unit = row.setByte(ordinal, value) override def setShort(value: Short): Unit = row.setShort(ordinal, value) override def setInt(value: Int): Unit = row.setInt(ordinal, value) override def setLong(value: Long): Unit = row.setLong(ordinal, value) override def setDouble(value: Double): Unit = row.setDouble(ordinal, value) override def setFloat(value: Float): Unit = row.setFloat(ordinal, value) }
But it never calls anything like "setNull()" when encountering a NULL value.
To fix the bug, we need to know how many columns of data were present in each row of the Parquet data, so we can differentiate between a NULL value and a missing column.