Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-826

parquet.thrift comments for Statistics are not consistent with parquet-mr and Hive implementations

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9.0
    • Component/s: parquet-format
    • Labels:
      None

      Description

      I'm currently working on adding support for writing min/max statistics to Parquet files to Impala (IMPALA-3909). I noticed, that the comments in parquet.thrift#L201 don't seem to match the implementations in parquet-mr and Hive.

      The comments ask for min/max statistics to be "encoded in PLAIN encoding". For strings (BYTE_ARRAY), this should be "4 byte length stored as little endian, followed by bytes".

      Looking at BinaryStatistics.java#L61, it seems to return the bytes without a length-prefix. Writing a parquet file with Hive also shows this behavior.

      Similarly, but less ambiguous, PLAIN encoding for booleans uses bit-packing. It seems to be implied that for a single bit (min/max of a boolean column) it means setting the least significant bit of a single byte. This could be made more clear in the parquet.thrift file, too.

        Attachments

          Activity

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              lv Lars Volker
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: