[ARROW-8244] [Python][Parquet] Add `write_to_dataset` option to populate the "file_path" metadata fields - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0
Component/s: Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/24440

Description

Prior to [dask#6023|https://github.com/dask/dask/pull/6023], Dask has been using the `write_to_dataset` API to write partitioned parquet datasets. This PR is switching to a (hopefully temporary) custom solution, because that API makes it difficult to populate the the "file_path" column-chunk metadata fields that are returned within the optional `metadata_collector` kwarg. Dask needs to set these fields correctly in order to generate a proper global `"_metadata"` file.

Possible solutions to this problem:

Optionally populate the file-path fields within `write_to_dataset`
Always populate the file-path fields within `write_to_dataset`
Return the file paths for the data written within `write_to_dataset` (up to the user to manually populate the file-path fields)

Attachments

Issue Links

links to

GitHub Pull Request #6797

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Rick Zamora

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/Mar/20 14:51

Updated:: 11/Jan/23 07:59

Resolved:: 05/Apr/20 22:21

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: