[SPARK-23852] Parquet MR bug can lead to incorrect SQL results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.1, 2.4.0
Component/s: SQL
Labels:
- correctness

Description

Parquet MR 1.9.0 and 1.8.2 both have a bug, ~~PARQUET-1217~~, that means that pushing certain predicates to Parquet scanners can return fewer results than they should.

The bug triggers in Spark when:

The Parquet file being scanner has stats for the null count, but not the max or min on the column with the predicate (Apache Impala writes files like this).
The vectorized Parquet reader path is not taken, and the parquet-mr reader is used.
A suitable <, <=, > or >= predicate is pushed down to Parquet.

The bug is that the parquet-mr interprets the max and min of a row-group's column as 0 in the absence of stats. So col > 0 will filter all results, even if some are > 0.

There is no upstream release of Parquet that contains the fix for ~~PARQUET-1217~~, although a 1.10 release is planned.

The least impactful workaround is to set the Parquet configuration parquet.filter.stats.enabled to false.

Attachments

Issue Links

is caused by

PARQUET-1217 Incorrect handling of missing values in Statistics

Resolved

links to

[Github] Pull Request #21284 (henryr)

[Github] Pull Request #21302 (henryr)

[Github] Pull Request #21323 (henryr)

Activity

People

Assignee:: Ryan Blue

Reporter:: Henry Robinson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 02/Apr/18 20:41

Updated:: 20/Sep/18 17:51

Resolved:: 10/May/18 02:56