[ARROW-8658] [C++][Dataset] Implement subtree pruning for FileSystemDataset::GetFragments - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.17.0
Fix Version/s: 4.0.0
Component/s: C++
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24819

Description

This is a very handy optimization for large datasets with multiple partition fields. For example, given a hive-style directory $base_dir/a=3/ and a filter "a"_ == 2 none of its files or subdirectories need be examined.

After ~~ARROW-8318~~ FileSystemDataset stores only files so subtree pruning (whose implementation depended on the presence of directories to represent subtrees) was disabled. It should be possible to reintroduce this without reference to directories by examining partition expressions directly and extracting a tree structure from their subexpressions.

Attachments

Issue Links

is related to

ARROW-11781 [Python] Reading small amount of files from a partitioned dataset is unexpectedly slow

Resolved

links to

GitHub Pull Request #9670

Activity

People

Assignee:: Ben Kietzman

Reporter:: Ben Kietzman

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Apr/20 19:02

Updated:: 11/Jan/23 08:01

Resolved:: 12/Mar/21 15:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5h 10m