Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9895

[RUST] Improve sort kernels

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 2.0.0
    • Rust

    Description

      Followup from my mailing list post:

      1. When sorting by multiple columns (lexsort_to_indices) the Float32
      and Float64 data types are not supported because the implementation
      relies on the OrdArray trait. This trait is not implemented because
      f64/f32 only implements PartialOrd. The sort function for a single
      column (sort_to_indices) has some special logic which looks like it
      wants to treats NaN the same as null, but I'm also not convinced this
      is the correct way. For example postgres does the following
      (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT)

      "In order to allow floating-point values to be sorted and used in
      tree-based indexes, PostgreSQL treats NaN values as equal, and greater
      than all non-NaN values."

      I propose to do the same in an OrdArray impl for
      Float64Array/Float32Array and then simplifying the sort_to_indices
      function accordingly.

      2. Sorting for dictionary encoded strings. The problem here is that
      DictionaryArray does not have a generic parameter for the value type
      so it is not currently possible to only implement OrdArray for string
      dictionaries. Again for the single column case, the value data type
      could be checked and a sort could be implemented by looking up each
      key in the dictionary. An optimization could be to check the is_sorted
      flag of DictionaryArray (which does not seem to be used really) and
      then directly sort by the keys. For the general case I see roughly to
      options

      • Somehow implement an OrdArray view of the dictionary array. This
        could be easier if OrdArray did not extend Array but was a completely
        separate trait.
      • Change the lexicographic sort impl to not use dynamic calls but
        instead sort multiple times. So for a query `ORDER BY a, b`, first
        sort by b and afterwards sort again by a. With a stable sort
        implementation this should result in the same ordering. I'm curious
        about the performance, it could avoid dynamic method calls for each
        comparison, but it would process the indices vector multiple times.

      My plan is to open a draft PR with the following changes:

      • sort_to_indices further splits up float64/float32 inputs into nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices according to the sort options. Nans are distinct from null and sort greater than any other valid value
      • implement a sort method for dictionary arrays with string values. this kernel checks the is_ordered flag and sorts just by the keys if it is set, it will look up the string values otherwise
      • for the lexical sort use case the above kernel are not used, instead the OrdArray trait is used. To make that more flexible and allow wrapping arrays with differend ordering behavior I will make it no longer extend Array and instead only contain the cmp_value method
      • string dictionary sorting can then be implemented with a wrapper struct StringDictionaryArrayAsOrdArray which implements OrdArray
      • NaN aware sorting of floats can also be implemented with a wrapper struct and trait implementation

      Attachments

        Issue Links

          Activity

            People

              jhorstmann Jörn Horstmann
              jhorstmann Jörn Horstmann
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m