[ARROW-3408] [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.16.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/19735

Description

For many datasets, dictionary encoding everything can result in drastically lower memory usage and subsequently better performance in doing analytics

One difficulty of dictionary encoding in multithreaded conversions is that ideally you end up with one dictionary at the end. So you have two options:

Implement a concurrent hashing scheme – for low cardinality dictionaries, the overhead associated with mutex contention will not be meaningful, for high cardinality it can be more of a problem

Hash each chunk separately, then normalize at the end

My guess is that a crude concurrent hash table with a mutex to protect mutations and resizes is going to outperform the latter

Attachments

Issue Links

depends upon

ARROW-5052 [C++] Add an incomplete dictionary type

Closed

links to

GitHub Pull Request #5785

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Oct/18 17:48

Updated:: 11/Jan/23 07:27

Resolved:: 07/Nov/19 16:49

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 40m