[ARROW-10100] [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/18298

Description

From discussion at https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.

Use cases:

Read only a set of row groups ids (this is similar as ParquetFile.read_row_groups), eg because you want to control the size of the resulting table by reading subsets of row groups
Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups

The first case could for example be solved by adding a row_groups keyword to ParquetFileFragment.to_table (but, this is then a keyword specific to the parquet format, and we should then probably also add it to scan et al).

The second case is something you can in principle do yourself manually by recreating a fragment with fragment.format.make_fragment(fragment.path, ..., row_groups=[...]). However, this is a) a bit cumbersome and b) statistics might need to be parsed again?
The statistics of a set of filtered row groups could also be obtained by using split_by_row_group(filter) (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.

So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids (fragment.subset(row_groups=[...]) or either based on a filter (fragment.subset(filter=...), which would be equivalent as split_by_row_group+recombining into a single fragment) ?

cc bkietz rjzamora

Attachments

Issue Links

links to

GitHub Pull Request #8301

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Sep/20 19:51

Updated:: 11/Jan/23 08:11

Resolved:: 10/Oct/20 19:29

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h