Details
-
Wish
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.10.0
Description
Say I have a pandas DataFrame df that I would like to store on disk as dataset using pyarrow parquet, I would do this:
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(table, root_path=some_path, partition_cols=['a',])
On disk the dataset would look like something like this:
some_path
├── a=1
____├── 4498704937d84fe5abebb3f06515ab2d.parquet
├── a=2
____├── 8bcfaed8986c4bdba587aaaee532370c.parquet
Wished Feature: It'd be great if I can override the auto-assignment of the long UUID as filename somehow during the dataset writing. My purpose is to be able to overwrite the dataset on disk when I have a new version of df. Currently if I try to write the dataset again, another new uniquely named [UUID].parquet file will be placed next to the old one, with the same, redundant data.
Attachments
Issue Links
- links to