Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15265

[C++][Python][Dataset] write_dataset with delete_matching hangs when the number of partitions is too large

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 6.0.1
    • 7.0.0
    • C++

    Description

      I'm attempting to use use the existing_data_behavior="delete_matching" option when using ds.write_dataset to write a hive partitioned parquet file to S3. This seems to work perfectly fine when the table being written is creating 7 or fewer partitions, but as soon as the partition column in the table has an 8th unique value the write completely hangs.

       

      import numpy as np
      import pyarrow as pa
      from pyarrow import fs
      import pyarrow.dataset as ds
      
      bucket = "my-bucket"
      s3 = fs.S3FileSystem()
      
      cols_7 = ["a", "b", "c", "d", "e", "f", "g"]
      table_7 = pa.table(
          {"col1": cols_7 * 5, "col2": np.random.randn(len(cols_7) * 5)}
      )
      # succeeds
      ds.write_dataset(
          data=table_7,
          base_dir=f"{bucket}/test7.parquet",
          format="parquet",
          partitioning=["col1"],
          partitioning_flavor="hive",
          filesystem=s3,
          existing_data_behavior="delete_matching",
      )
      
      cols_8 = ["a", "b", "c", "d", "e", "f", "g", "h"]
      table_8 = pa.table(
          {"col1": cols_8 * 5, "col2": np.random.randn(len(cols_8) * 5)}
      )
      # this hangs
      ds.write_dataset(
          data=table_8,
          base_dir=f"{bucket}/test8.parquet",
          format="parquet",
          partitioning=["col1"],
          partitioning_flavor="hive",
          filesystem=s3,
          existing_data_behavior="delete_matching",
      ) 

      For the file with 8 partitions, the directory structure is created in S3 but no actual files are written before hanging.

       

      Attachments

        Issue Links

          Activity

            People

              lidavidm David Li
              coverman Caleb Overman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5.5h
                  5.5h