Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5917

[Java] Redesign the dictionary encoder

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Java

      Description

      The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice:

      1. There are repeated conversions between Java objects and bytes (e.g. vector.getObject).
      2. Unnecessary memory copy (the vector data must be copied to the hash table).
      3. The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
      4. The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
      5. The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.

      We plan to implement a new one in the algorithm module, and gradually deprecate the current one.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                fan_li_ya Liya Fan
                Reporter:
                fan_li_ya Liya Fan
              • Votes:
                1 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 8h 40m
                  8h 40m