Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3538

[Python] ability to override the automated assignment of uuid for filenames when writing datasets

    XMLWordPrintableJSON

    Details

      Description

      Say I have a pandas DataFrame df that I would like to store on disk as dataset using pyarrow parquet, I would do this:

      table = pyarrow.Table.from_pandas(df)
      pyarrow.parquet.write_to_dataset(table, root_path=some_path, partition_cols=['a',])

      On disk the dataset would look like something like this:
      some_path
      ├── a=1
      ____├── 4498704937d84fe5abebb3f06515ab2d.parquet
      ├── a=2
      ____├── 8bcfaed8986c4bdba587aaaee532370c.parquet

      Wished Feature: It'd be great if I can override the auto-assignment of the long UUID as filename somehow during the dataset writing. My purpose is to be able to overwrite the dataset on disk when I have a new version of df. Currently if I try to write the dataset again, another new uniquely named [UUID].parquet file will be placed next to the old one, with the same, redundant data.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Tomme Thomas Elvey
                Reporter:
                XiUpsilon Ji Xu
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m