[SPARK-11153] Turns off Parquet filter push-down for string and binary columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.5.0, 1.5.1
Fix Version/s: 1.5.2, 1.6.0
Component/s: SQL
Labels:
None

Target Version/s:

1.5.2

Description

Due to ~~PARQUET-251~~, BINARY columns in existing Parquet files may be written with corrupted statistics information. This information is used by filter push-down optimization. Since Spark 1.5 turns on Parquet filter push-down by default, we may end up with wrong query results. ~~PARQUET-251~~ has been fixed in parquet-mr 1.8.1, but Spark 1.5 is still using 1.7.0.

Note that this kind of corrupted Parquet files could be produced by any Parquet data models.

This affects all Spark SQL data types that can be mapped to Parquet BINARY, namely:

StringType
BinaryType
DecimalType (but Spark SQL doesn't support pushing down DecimalType columns for now.)

To avoid wrong query results, we should disable filter push-down for columns of StringType and BinaryType until we upgrade to parquet-mr 1.8.

Attachments

Issue Links

is related to

SPARK-9876 Upgrade parquet-mr to 1.8.1

Resolved

relates to

SPARK-11784 Support Timestamp filter pushdown in Parquet datasource

Resolved

SPARK-6859 Parquet File Binary column statistics error when reuse byte[] among rows

Resolved

links to

[Github] Pull Request #9152 (liancheng)

Activity

People

Assignee:: Cheng Lian

Reporter:: Cheng Lian

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 16/Oct/15 18:22

Updated:: 01/Jun/16 23:35

Resolved:: 21/Oct/15 01:03