Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8061

[C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

    XMLWordPrintableJSON

Details

    Description

      Specifically for parquet (not sure if it will be relevant for other file formats as well, for IPC/feather potentially ther record batch), it would be useful to target row groups instead of files as fragments.

      Quoting the original design documents: "In datasets consisting of many fragments, the dataset API must expose the granularity of fragments in a public way to enable parallel processing, if desired. ".
      And a comment from Wes on that: "a single Parquet file can "export" one or more fragments based on settings. The default might be to split fragments based on row group"

      Currently, the level on which fragments are defined (at least in the typical partitioned parquet dataset) is "1 file == 1 fragment".

      Would it be possible or desirable to make this more fine grained, where you could also opt to have a fragment per row group?
      We could have a ParquetFragment that has this option, and a ParquetFileFormat specific option to say what the granularity of a fragment is (file vs row group)?

      cc fsaintjacques bkietz

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 50m
                  5h 50m