Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16204

[C++][Dataset] Default error existing_data_behaviour for writing dataset ignores a single file

    XMLWordPrintableJSON

Details

    Description

      While trying to understand a failing test in https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed that the write_dataset function does not actually always raise an error by default if there is already existing data in the target location.

      The documentation says it will raise "if any data exists in the destination" (which is also what I would expect), but in practice it seems that it does ignore certain file names:

      import pyarrow.dataset as ds
      table = pa.table({'a': [1, 2, 3]})
      
      # write a first time to new directory: OK
      >>> ds.write_dataset(table, "test_overwrite", format="parquet")
      >>> !ls test_overwrite
      part-0.parquet
      
      # write a second time to the same directory: passes, but should raise?
      >>> ds.write_dataset(table, "test_overwrite", format="parquet")
      >>> !ls test_overwrite
      part-0.parquet
      
      # write a another time to the same directory with different name: still passes
      >>> ds.write_dataset(table, "test_overwrite", format="parquet", basename_template="data-{i}.parquet")
      >>> !ls test_overwrite
      data-0.parquet	part-0.parquet
      
      # now writing again finally raises an error
      >>> ds.write_dataset(table, "test_overwrite", format="parquet")
      ...
      ArrowInvalid: Could not write to test_overwrite as the directory is not empty and existing_data_behavior is to error
      

      So it seems that when checking if existing data exists, it seems to ignore any files that match the basename template pattern.

      cc westonpace do you know if this was intentional? (I would find that a strange corner case, and in any case it is also not documented)

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h