[ARROW-16204] [C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.0.0
Component/s: C++
Labels:
- dataset
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/20193

Description

While trying to understand a failing test in https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed that the write_dataset function does not actually always raise an error by default if there is already existing data in the target location.

The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:

import pyarrow.dataset as ds
table = pa.table({'a': [1, 2, 3]})

# write a first time to new directory: OK
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a second time to the same directory: passes, but should raise?
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a another time to the same directory with different name: still passes
>>> ds.write_dataset(table, "test_overwrite", format="parquet", basename_template="data-{i}.parquet")
>>> !ls test_overwrite
data-0.parquet	part-0.parquet

# now writing again finally raises an error
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
...
ArrowInvalid: Could not write to test_overwrite as the directory is not empty and existing_data_behavior is to error

So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.

cc westonpace do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)

Attachments

Issue Links

links to

GitHub Pull Request #12898

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Apr/22 08:48

Updated:: 11/Jan/23 11:42

Resolved:: 22/Apr/22 16:11

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: