[ARROW-5917] [Java] Redesign the dictionary encoder - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: Java
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/22327

Description

The current dictionary encoder implementation (org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance overhead, which prevents it from being useful in practice:

There are repeated conversions between Java objects and bytes (e.g. vector.getObject).
Unnecessary memory copy (the vector data must be copied to the hash table).
The hash table cannot be reused for encoding multiple vectors (other data structure & results cannot be reused either).
The output vector should not be created/managed by the encoder (just like in the out-of-place sorter)
The hash table requires that the hashCode & equals methods be implemented appropriately, but this is not guaranteed.

We plan to implement a new one in the algorithm module, and gradually deprecate the current one.

Attachments

Issue Links

relates to

ARROW-6184 [Java] Provide hash table based dictionary encoder

Resolved

links to

GitHub Pull Request #4994

Activity

People

Assignee:: Liya Fan

Reporter:: Liya Fan

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Jul/19 08:25

Updated:: 11/Jan/23 07:43

Resolved:: 18/Sep/19 04:45

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

12h 20m