Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1655

[C++] Decimal comparisons used for min/max statistics are not correct

    XMLWordPrintableJSON

Details

    Description

      The Parquet Format specifications says

      If the column uses int32 or int64 physical types, then signed comparison of the integer values produces the correct ordering. If the physical type is fixed, then the correct ordering can be produced by flipping the most-significant bit in the first byte and then using unsigned byte-wise comparison.

      However this isn't followed in the C++ Parquet code. 16-byte decimal comparison is implemented using a lexicographical comparison of signed chars.

      This appears to be because the function https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183 just goes off the sort_order (signed) and physical_type (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.

      Attachments

        Issue Links

          Activity

            People

              emkornfield Micah Kornfield
              philjdf Philip Felton
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 10m
                  3h 10m