Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.2.0, 2.2.2, 2.3.1
Description
In current Spark 2.3.1, below query returns wrong data silently.
spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+
Root Cause
After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema.
1. Wrong column is pushdown.
Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually).
So no records are returned.
Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue.
2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false.
SPARK-25132 addressed this issue already.
The biggest difference is, in Spark 2.1, user will get Exception for the same query:
Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!
So they will know the issue and fix the query.
But in Spark 2.3, user will get the wrong results sliently.
To make the above query work, we need both SPARK-25132 and SPARK-24716.
yumwang , cloud_fan, smilegator, any thoughts? Should we backport it?
Attachments
Attachments
Issue Links
- relates to
-
SPARK-24716 Refactor ParquetFilters
- Resolved
-
SPARK-25207 Case-insensitve field resolution for filter pushdown when reading Parquet
- Resolved
-
SPARK-25132 Case-insensitive field resolution when reading from Parquet
- Resolved