[ARROW-3772] [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: C++
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/20110

Description

Dictionary data is very common in parquet, in the current implementation parquet-cpp decodes dictionary encoded data always before creating a plain arrow array. This process is wasteful since we could use arrow's DictionaryArray directly and achieve several benefits:

Smaller memory footprint - both in the decoding process and in the resulting arrow table - especially when the dict values are large
Better decoding performance - mostly as a result of the first bullet - less memory fetches and less allocations.

I think those benefits could achieve significant improvements in runtime.

My direction for the implementation is to read the indices (through the DictionaryDecoder, after the RLE decoding) and values separately into 2 arrays and create a DictionaryArray using them.

There are some questions to discuss:

Should this be the default behavior for dictionary encoded data
Should it be controlled with a parameter in the API
What should be the policy in case some of the chunks are dictionary encoded and some are not.

I started implementing this but would like to hear your opinions.

Attachments

Issue Links

is depended upon by

ARROW-3325 [Python] Support reading Parquet binary/string columns directly as DictionaryArray

Resolved

is related to

ARROW-3652 [Python] CategoricalIndex is lost after reading back

Resolved

ARROW-5993 [Python] Reading a dictionary column from Parquet results in disproportionate memory usage

Closed

ARROW-5984 [C++] Provide method on AdaptiveIntBuilder for appending integer Array types

Open

ARROW-3325 [Python] Support reading Parquet binary/string columns directly as DictionaryArray

Resolved

ARROW-6049 [C++] Support using Array::View from compatible dictionary type to another

Resolved

ARROW-3246 [Python][Parquet] direct reading/writing of pandas categoricals in parquet

Resolved

relates to

ARROW-6140 [C++][Parquet] Support direct dictionary decoding of types other than BYTE_ARRAY

Open

links to

GitHub Pull Request #4949

(2 is related to, 1 relates to, 1 links to)

Activity

People

Assignee:: Wes McKinney

Reporter:: Stav Nir

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 13/Jun/18 14:08

Updated:: 11/Jan/23 07:29

Resolved:: 26/Jul/19 23:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m