Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Currently, the dataset writing (eg with pyarrow.dataset.write_dataset) uses a fixed filename template ("part{i}.ext"). This means that when you are writing to an existing dataset, you de facto overwrite previous data when using this default template.
There is some discussion in ARROW-10695 about how the user can avoid this by ensuring the file names are unique (the user can specify the basename_template to be something unique). There is also ARROW-7706 about silently doubling data (so not overwriting existing data) with the legacy parquet.write_to_dataset implementation.
It could be good to have a "mode" when writing datasets that controls the different possible behaviours. And erroring when there is pre-existing data in the target directory is maybe the safest default, because both appending vs overwriting silently can be surprising behaviour depending on your expectations.
Attachments
Issue Links
- depends upon
-
ARROW-12509 [C++] More fine-grained control of file creation in filesystem layer
- In Progress
- fixes
-
ARROW-12365 [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
- Closed
- is related to
-
ARROW-12811 [C++] [Dataset] Dataset repartition / filter / update
- Open
-
ARROW-10695 [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
- Closed
- relates to
-
ARROW-7706 [Python] saving a dataframe to the same partitioned location silently doubles the data
- Open