[SPARK-6859] Parquet File Binary column statistics error when reuse byte[] among rows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.2.0, 1.3.0, 1.4.0
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Description

Suppose I create a dataRDD which extends RDD[Row], and each row is GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max & min) of Binary column would be wrong.

Here is the reason: In Parquet, BinaryStatistic just keep max & min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array[Byte] passed from row.

	reference		backed
max: Binary	---------->	ByteArrayBackedBinary	---------->	Array[Byte]

Therefore, each time parquet updating row group's statistic, max & min would always refer to the same Array[Byte], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max & min.

It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA?

Attachments

Issue Links

is related to

SPARK-11153 Turns off Parquet filter push-down for string and binary columns

Resolved

relates to

SPARK-11784 Support Timestamp filter pushdown in Parquet datasource

Resolved

SPARK-9876 Upgrade parquet-mr to 1.8.1

Resolved

Activity

People

Assignee:: Ryan Blue

Reporter:: Yijie Shen

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Apr/15 11:24

Updated:: 31/May/16 22:12

Resolved:: 31/May/16 22:12