Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-826

parquet.thrift comments for Statistics are not consistent with parquet-mr and Hive implementations

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.9.0
    • parquet-format
    • None

    Description

      I'm currently working on adding support for writing min/max statistics to Parquet files to Impala (IMPALA-3909). I noticed, that the comments in parquet.thrift#L201 don't seem to match the implementations in parquet-mr and Hive.

      The comments ask for min/max statistics to be "encoded in PLAIN encoding". For strings (BYTE_ARRAY), this should be "4 byte length stored as little endian, followed by bytes".

      Looking at BinaryStatistics.java#L61, it seems to return the bytes without a length-prefix. Writing a parquet file with Hive also shows this behavior.

      Similarly, but less ambiguous, PLAIN encoding for booleans uses bit-packing. It seems to be implied that for a single bit (min/max of a boolean column) it means setting the least significant bit of a single byte. This could be made more clear in the parquet.thrift file, too.

      Attachments

        Activity

          People

            lv Lars Volker
            lv Lars Volker
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: