Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11732

[C++] DictionaryEncode should convert dictionaries from one type of encoding to the other

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++
    • None

    Description

      There are two styles of encoding nulls in dictionaries (masked or encoded).  In compute:: DictionaryEncode this is controlled by an option.  Today, if you pass a dictionary into DictionaryEncode it is a no-op.

      Instead it should check to see if the dictionary is properly encoded (this is easily checked in constant time) according to the requested null encoding scheme and, if not, it should convert it.

      The default NullEncodingBehavior should also change to EXISTING_OR_ENCODE or a second option should be added so that this doesn't change existing behavior.

      Once this is done then partition.cc could be improved.  It currently requires dictionaries use "encoded nulls" and, if a dictionary is passed in that uses "masked nulls" then it uncodes and re-encodes the dictionary which is a potentially costly operation.  This could be fixed to use the conversion.

      Attachments

        Activity

          People

            Unassigned Unassigned
            westonpace Weston Pace
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: