Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
6.0.1
Description
I'm attempting to use use the existing_data_behavior="delete_matching" option when using ds.write_dataset to write a hive partitioned parquet file to S3. This seems to work perfectly fine when the table being written is creating 7 or fewer partitions, but as soon as the partition column in the table has an 8th unique value the write completely hangs.
import numpy as np import pyarrow as pa from pyarrow import fs import pyarrow.dataset as ds bucket = "my-bucket" s3 = fs.S3FileSystem() cols_7 = ["a", "b", "c", "d", "e", "f", "g"] table_7 = pa.table( {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)} ) # succeeds ds.write_dataset( data=table_7, base_dir=f"{bucket}/test7.parquet", format="parquet", partitioning=["col1"], partitioning_flavor="hive", filesystem=s3, existing_data_behavior="delete_matching", ) cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"] table_8 = pa.table( {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)} ) # this hangs ds.write_dataset( data=table_8, base_dir=f"{bucket}/test8.parquet", format="parquet", partitioning=["col1"], partitioning_flavor="hive", filesystem=s3, existing_data_behavior="delete_matching", )
For the file with 8 partitions, the directory structure is created in S3 but no actual files are written before hanging.
Attachments
Issue Links
- is related to
-
ARROW-15285 [C++] write_dataset with delete_matching occasionally fails with "Path does not exist"
- Open
- links to