[ARROW-15265] [C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 6.0.1
Fix Version/s: 7.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/30758

Description

I'm attempting to use use the existing_data_behavior="delete_matching" option when using ds.write_dataset to write a hive partitioned parquet file to S3. This seems to work perfectly fine when the table being written is creating 7 or fewer partitions, but as soon as the partition column in the table has an 8th unique value the write completely hangs.

import numpy as np
import pyarrow as pa
from pyarrow import fs
import pyarrow.dataset as ds

bucket = "my-bucket"
s3 = fs.S3FileSystem()

cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
table_7 = pa.table(
    {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
)
# succeeds
ds.write_dataset(
    data=table_7,
    base_dir=f"{bucket}/test7.parquet",
    format="parquet",
    partitioning=["col1"],
    partitioning_flavor="hive",
    filesystem=s3,
    existing_data_behavior="delete_matching",
)

cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
table_8 = pa.table(
    {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
)
# this hangs
ds.write_dataset(
    data=table_8,
    base_dir=f"{bucket}/test8.parquet",
    format="parquet",
    partitioning=["col1"],
    partitioning_flavor="hive",
    filesystem=s3,
    existing_data_behavior="delete_matching",
)

For the file with 8 partitions, the directory structure is created in S3 but no actual files are written before hanging.

Attachments

Issue Links

is related to

ARROW-15285 [C++] write_dataset with delete_matching occasionally fails with "Path does not exist"

Open

links to

GitHub Pull Request #12099

Activity

People

Assignee:: David Li

Reporter:: Caleb Overman

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 05/Jan/22 23:07

Updated:: 11/Jan/23 11:35

Resolved:: 12/Jan/22 20:32

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

5.5h