Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-839

Min-max should be computed based on logical type

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: format-2.3.1
    • Fix Version/s: None
    • Component/s: parquet-format
    • Labels:
      None

      Description

      The min/max stats are currently underspecified - it is not clear in any cases from the spec what the expected ordering is.

      There are some related issues, like PARQUET-686 to fix specific problems, but there seems to be a general assumption that the min/max should be defined based on the primitive type instead of the logical type.

      However, this makes the stats nearly useless for some logical types. E.g. consider a DECIMAL encoded into a (variable-length) BINARY. The min-max of the underlying binary type is based on the lexical order of the byte string, but that does not correspond to any reasonable ordering of the decimal values. E.g. 16 (0x1 0x0) will be ordered between 1 (0x0) and (0x2). This makes min-max filtering a lot less effective and would force query engines using parquet to implement workarounds to produce correct results (e.g. custom comparators).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                tarmstrong Tim Armstrong
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: