Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3538

[Python] ability to override the automated assignment of uuid for filenames when writing datasets

    XMLWordPrintableJSON

Details

    Description

      Say I have a pandas DataFrame df that I would like to store on disk as dataset using pyarrow parquet, I would do this:

      table = pyarrow.Table.from_pandas(df)
      pyarrow.parquet.write_to_dataset(table, root_path=some_path, partition_cols=['a',])

      On disk the dataset would look like something like this:
      some_path
      ├── a=1
      ____├── 4498704937d84fe5abebb3f06515ab2d.parquet
      ├── a=2
      ____├── 8bcfaed8986c4bdba587aaaee532370c.parquet

      Wished Feature: It'd be great if I can override the auto-assignment of the long UUID as filename somehow during the dataset writing. My purpose is to be able to overwrite the dataset on disk when I have a new version of df. Currently if I try to write the dataset again, another new uniquely named [UUID].parquet file will be placed next to the old one, with the same, redundant data.

      Attachments

        Issue Links

          Activity

            People

              Tomme Thomas Elvey
              XiUpsilon Ji Xu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m