Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12264

[C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++, Parquet

    Description

      The Parquet spec (in parquet.thrift) says the following about handling of floating-point statistics:

         * (*) Because the sorting order is not specified properly for floating
         *     point values (relations vs. total ordering) the following
         *     compatibility rules should be applied when reading statistics:
         *     - If the min is a NaN, it should be ignored.
         *     - If the max is a NaN, it should be ignored.
         *     - If the min is +0, the row group may contain -0 values as well.
         *     - If the max is -0, the row group may contain +0 values as well.
         *     - When looking for NaN values, min and max should be ignored.
      

      It appears that the dataset code uses the following filter expression when doing Parquet predicate push-down (in file_parquet.cc):

          return and_(greater_equal(field_expr, literal(min)),
                      less_equal(field_expr, literal(max)));
      

      A NaN value will fail that filter and yet may be found in the given Parquet column chunk.

      We may instead need a "greater_equal_or_nan" comparison that returns true if either value is NaN.

      Attachments

        Issue Links

          Activity

            People

              sanjibansg Sanjiban Sengupta
              apitrou Antoine Pitrou
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m