Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8244

[Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields

    XMLWordPrintableJSON

Details

    Description

      Prior to [dask#6023|https://github.com/dask/dask/pull/6023], Dask has been using the `write_to_dataset` API to write partitioned parquet datasets.  This PR is switching to a (hopefully temporary) custom solution, because that API makes it difficult to populate the the "file_path"  column-chunk metadata fields that are returned within the optional `metadata_collector` kwarg.  Dask needs to set these fields correctly in order to generate a proper global `"_metadata"` file.

      Possible solutions to this problem:

      1. Optionally populate the file-path fields within `write_to_dataset`
      2. Always populate the file-path fields within `write_to_dataset`
      3. Return the file paths for the data written within `write_to_dataset` (up to the user to manually populate the file-path fields)

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              rjzamora Rick Zamora
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h