[ARROW-12358] [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/28157

Description

Currently, the dataset writing (eg with pyarrow.dataset.write_dataset) uses a fixed filename template ("part{i}.ext"). This means that when you are writing to an existing dataset, you de facto overwrite previous data when using this default template.

There is some discussion in ~~ARROW-10695~~ about how the user can avoid this by ensuring the file names are unique (the user can specify the basename_template to be something unique). There is also ARROW-7706 about silently doubling data (so not overwriting existing data) with the legacy parquet.write_to_dataset implementation.

It could be good to have a "mode" when writing datasets that controls the different possible behaviours. And erroring when there is pre-existing data in the target directory is maybe the safest default, because both appending vs overwriting silently can be surprising behaviour depending on your expectations.

Attachments

Issue Links

depends upon

ARROW-12509 [C++] More fine-grained control of file creation in filesystem layer

In Progress

fixes

ARROW-12365 [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

Closed

is related to

ARROW-12811 [C++] [Dataset] Dataset repartition / filter / update

Open

ARROW-10695 [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

Closed

relates to

ARROW-7706 [Python] saving a dataframe to the same partitioned location silently doubles the data

Open

Activity

People

Assignee:: Unassigned

Reporter:: Joris Van den Bossche

Votes:: 2 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 13/Apr/21 12:25

Updated:: 11/Jan/23 08:26