In current Spark 2.3.1, below query returns wrong data silently.
After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema.
1. Wrong column is pushdown.
Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually).
So no records are returned.
SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue.
2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false.
SPARK-25132 addressed this issue already.
The biggest difference is, in Spark 2.1, user will get Exception for the same query:
So they will know the issue and fix the query.
But in Spark 2.3, user will get the wrong results sliently.