[ARROW-3538] [Python] ability to override the automated assignment of uuid for filenames when writing datasets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.10.0
Fix Version/s: 0.15.0
Component/s: Python
Labels:

External issue URL:
https://github.com/apache/arrow/issues/19853

Description

Say I have a pandas DataFrame df that I would like to store on disk as dataset using pyarrow parquet, I would do this:

table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(table, root_path=some_path, partition_cols=['a',])

On disk the dataset would look like something like this:
some_path
├── a=1
____├── 4498704937d84fe5abebb3f06515ab2d.parquet
├── a=2
____├── 8bcfaed8986c4bdba587aaaee532370c.parquet

Wished Feature: It'd be great if I can override the auto-assignment of the long UUID as filename somehow during the dataset writing. My purpose is to be able to overwrite the dataset on disk when I have a new version of df. Currently if I try to write the dataset again, another new uniquely named [UUID].parquet file will be placed next to the old one, with the same, redundant data.

Attachments

Issue Links

links to

GitHub Pull Request #4630

Activity

People

Assignee:: Thomas Elvey

Reporter:: Ji Xu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Oct/18 06:24

Updated:: 11/Jan/23 07:28

Resolved:: 20/Aug/19 02:39

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

[Python] ability to override the automated assignment of uuid for filenames when writing datasets