Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12358

[C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++

    Description

      Currently, the dataset writing (eg with pyarrow.dataset.write_dataset) uses a fixed filename template ("part{i}.ext"). This means that when you are writing to an existing dataset, you de facto overwrite previous data when using this default template.

      There is some discussion in ARROW-10695 about how the user can avoid this by ensuring the file names are unique (the user can specify the basename_template to be something unique). There is also ARROW-7706 about silently doubling data (so not overwriting existing data) with the legacy parquet.write_to_dataset implementation.

      It could be good to have a "mode" when writing datasets that controls the different possible behaviours. And erroring when there is pre-existing data in the target directory is maybe the safest default, because both appending vs overwriting silently can be surprising behaviour depending on your expectations.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jorisvandenbossche Joris Van den Bossche
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: