Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10100

[C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of row group ids

    XMLWordPrintableJSON

Details

    Description

      From discussion at https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the dataset API in their parquet reader), it might be useful to somehow "subset" or read a subset of a ParquetFileFragment for a specific set of row group ids.

      Use cases:

      • Read only a set of row groups ids (this is similar as ParquetFile.read_row_groups), eg because you want to control the size of the resulting table by reading subsets of row groups
      • Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups

      The first case could for example be solved by adding a row_groups keyword to ParquetFileFragment.to_table (but, this is then a keyword specific to the parquet format, and we should then probably also add it to scan et al).

      The second case is something you can in principle do yourself manually by recreating a fragment with fragment.format.make_fragment(fragment.path, ..., row_groups=[...]). However, this is a) a bit cumbersome and b) statistics might need to be parsed again?
      The statistics of a set of filtered row groups could also be obtained by using split_by_row_group(filter) (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids.

      So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids (fragment.subset(row_groups=[...]) or either based on a filter (fragment.subset(filter=...), which would be equivalent as split_by_row_group+recombining into a single fragment) ?

      cc bkietz rjzamora

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h