[ARROW-10695] [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Won't Do
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- dataset
- dataset-parquet-write

External issue URL:
https://github.com/apache/arrow/issues/26645

Description

Currently we allow the user to specify a basename_template, and this can include a "{i}" part to replace it with an automatically incremented integer (so each generated file written to a single partition is unique):

https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717

It might be useful to also have the ability to use a UUID, to ensure the file is unique in general (not only for a single write) and to mimic the behaviour of the old write_to_dataset implementation.

For example, we could look for a "{uuid}" in the template string, and if present replace it for each file with a new UUID.

Attachments

Issue Links

is duplicated by

ARROW-14010 [C++][Python] No way to generate UUID filenames with new datasets API

Closed

relates to

ARROW-12358 [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Open

ARROW-12365 [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 23/Nov/20 10:39

Updated:: 11/Jan/23 08:14

Resolved:: 23/Jun/21 16:06