Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
From discussion at https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.
Use cases:
- Read only a set of row groups ids (this is similar as ParquetFile.read_row_groups), eg because you want to control the size of the resulting table by reading subsets of row groups
- Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups
The first case could for example be solved by adding a row_groups keyword to ParquetFileFragment.to_table (but, this is then a keyword specific to the parquet format, and we should then probably also add it to scan et al).
The second case is something you can in principle do yourself manually by recreating a fragment with fragment.format.make_fragment(fragment.path, ..., row_groups=[...]). However, this is a) a bit cumbersome and b) statistics might need to be parsed again?
The statistics of a set of filtered row groups could also be obtained by using split_by_row_group(filter) (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.
So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids (fragment.subset(row_groups=[...]) or either based on a filter (fragment.subset(filter=...), which would be equivalent as split_by_row_group+recombining into a single fragment) ?
Attachments
Issue Links
- links to