[ARROW-6933] [Java] Suppor linear dictionary encoder - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.16.0
Component/s: Java
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/23254

Description

For many scenarios, the distribution of dictionary entries is highly skewed. In other words, a few dictionary entries occurs much more frequently than others. If we can sort the dictionary by the non-increasing order of entry frequencies, and compare each value to encode from the beginning of the dictionary, we get the following benefits:

1) We need no extra memory space or data structure.
2) The search is extremely efficient, as we are likely to find a match in the first few entries of the dictionary.

This is the basic idea behind the linear dictionary encoder. When the scenario is right (highly skewed dictionary distribution), it outperforms both search based encoder and hash table based encoders.

Attachments

Issue Links

links to

GitHub Pull Request #5692

Activity

People

Assignee:: Liya Fan

Reporter:: Liya Fan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Oct/19 11:46

Updated:: 11/Jan/23 07:50

Resolved:: 24/Oct/19 05:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: