[PARQUET-389] Filter predicates should work with missing columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0, 1.7.0, 1.8.0
Fix Version/s: 1.9.0, 1.8.2
Component/s: parquet-mr
Labels:
None

Description

This issue originates from ~~SPARK-11103~~, which contains detailed information about how to reproduce it.

The major problem here is that, filter predicates pushed down assert that columns they touch must exist in the target physical files. But this isn't true in case of schema merging.

Actually this assertion is unnecessary, because if a column is missing in the filter schema, the column is considered to be filled by nulls, and all the filters should be able to act accordingly. For example, if we push down a = 1 but a is missing in the underlying physical file, all records in this file should be dropped since a is always null. On the other hand, if we push down a IS NULL, all records should be preserved.

Attachments

Issue Links

is related to

SPARK-18539 Cannot filter by nonexisting column in parquet file

Resolved

relates to

SPARK-11103 Parquet filters push-down may cause exception when schema merging is turned on

Resolved

SPARK-20364 Parquet predicate pushdown on columns with dots return empty results

Resolved

links to

PR #354

Activity

People

Assignee:: Ryan Blue

Reporter:: Cheng Lian

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/Oct/15 08:07

Updated:: 21/Apr/18 12:38

Resolved:: 15/Jul/16 16:54