[ARROW-8647] [C++][Dataset] Optionally encode partition field values as dictionary type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/24808

Description

In the Python ParquetDataset implementation, the partition fields are returned as dictionary type columns.

In the new Dataset API, we now use a plain type (integer or string when inferred). But, you can already manually specify that the partition keys should be dictionary type by specifying the partitioning schema (in Partitioning passed to the dataset factory).

Since using dictionary type can be more efficient (since partition keys will typically be repeated values in the resulting table), it might be good to still have an option in the DatasetFactory to use dictionary types for the partition fields.

See also https://github.com/apache/arrow/pull/6303#discussion_r400622340

Attachments

Issue Links

relates to

ARROW-9288 [C++][Dataset] Discovery of partition field as dictionary type segfaulting with HivePartitioning

Resolved

links to

GitHub Pull Request #7536

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Apr/20 13:35

Updated:: 11/Jan/23 08:01

Resolved:: 30/Jun/20 13:55

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 10m