[ARROW-8061] [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0
Component/s: C++
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24274

Description

Specifically for parquet (not sure if it will be relevant for other file formats as well, for IPC/feather potentially ther record batch), it would be useful to target row groups instead of files as fragments.

Quoting the original design documents: "In datasets consisting of many fragments, the dataset API must expose the granularity of fragments in a public way to enable parallel processing, if desired. ".
And a comment from Wes on that: "a single Parquet file can "export" one or more fragments based on settings. The default might be to split fragments based on row group"

Currently, the level on which fragments are defined (at least in the typical partitioned parquet dataset) is "1 file == 1 fragment".

Would it be possible or desirable to make this more fine grained, where you could also opt to have a fragment per row group?
We could have a ParquetFragment that has this option, and a ParquetFileFormat specific option to say what the granularity of a fragment is (file vs row group)?

cc fsaintjacques bkietz

Attachments

Issue Links

links to

GitHub Pull Request #6670

Activity

People

Assignee:: Ben Kietzman

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Mar/20 18:00

Updated:: 11/Jan/23 07:57

Resolved:: 27/Mar/20 16:52

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5h 50m