Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6859

Parquet File Binary column statistics error when reuse byte[] among rows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.2.0, 1.3.0, 1.4.0
    • 2.0.0
    • SQL
    • None

    Description

      Suppose I create a dataRDD which extends RDD[Row], and each row is GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max & min) of Binary column would be wrong.



      Here is the reason: In Parquet, BinaryStatistic just keep max & min as parquet.io.api.Binary references, Spark sql would generate a new Binary backed by the same Array[Byte] passed from row.

        reference   backed  
      max: Binary ----------> ByteArrayBackedBinary ----------> Array[Byte]

      Therefore, each time parquet updating row group's statistic, max & min would always refer to the same Array[Byte], which has new content each time. When parquet decides to save it into file, the last row's content would be saved as both max & min.



      It seems it is a parquet bug because it's parquet's responsibility to update statistics correctly.
      But not quite sure. Should I report it as a bug in parquet JIRA?

      Attachments

        Issue Links

          Activity

            People

              rdblue Ryan Blue
              yijieshen Yijie Shen
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: