Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Specifically for parquet (not sure if it will be relevant for other file formats as well, for IPC/feather potentially ther record batch), it would be useful to target row groups instead of files as fragments.
Quoting the original design documents: "In datasets consisting of many fragments, the dataset API must expose the granularity of fragments in a public way to enable parallel processing, if desired. ".
And a comment from Wes on that: "a single Parquet file can "export" one or more fragments based on settings. The default might be to split fragments based on row group"
Currently, the level on which fragments are defined (at least in the typical partitioned parquet dataset) is "1 file == 1 fragment".
Would it be possible or desirable to make this more fine grained, where you could also opt to have a fragment per row group?
We could have a ParquetFragment that has this option, and a ParquetFileFormat specific option to say what the granularity of a fragment is (file vs row group)?
Attachments
Issue Links
- links to