Description
Suppose I create a dataRDD which extends RDD[Row], and each row is GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max & min) of Binary column would be wrong.
Here is the reason: In Parquet, BinaryStatistic just keep max & min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array[Byte] passed from row.
reference | backed | |||
max: Binary | ----------> | ByteArrayBackedBinary | ----------> | Array[Byte] |
Therefore, each time parquet updating row group's statistic, max & min would always refer to the same Array[Byte], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max & min.
It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly.
But not quite sure. Should I report it as a bug in parquet JIRA?
Attachments
Issue Links
- is related to
-
SPARK-11153 Turns off Parquet filter push-down for string and binary columns
- Resolved
- relates to
-
SPARK-11784 Support Timestamp filter pushdown in Parquet datasource
- Resolved
-
SPARK-9876 Upgrade parquet-mr to 1.8.1
- Resolved