Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11857

[Python] Resource temporarily unavailable when using the new Dataset API with Pandas

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 3.0.0
    • 4.0.0
    • Python
    • None
    • OS: Debian GNU/Linux 10 (buster) x86_64
      Kernel: 4.19.0-14-amd64
      CPU: Intel i7-6700K (8) @ 4.200GHz
      Memory: 32122MiB
      Python: v3.7.3

    Description

      When using the new Dataset API under v3.0.0 it instantly crashes with

       terminate called after throwing an instance of 'std::system_error'
       what(): Resource temporarily unavailable

      This does not happen in an earlier version. The error message leads me to believe that the issue is not on the Python side but might be in the C++ libraries.

      As background, I am using the new Dataset API by calling the following

      s3_fs = fs.S3FileSystem(<minio credentials>)
      dataset = pq.ParquetDataset(
              f"{bucket}/{base_path}",
              filesystem=s3_fs,
              partitioning="hive",
              use_legacy_dataset=False,
              filters=filters
      )
      dataframe = dataset.read_pandas(columns=columns).to_pandas()

      The dataset itself contains 10,000s of files around 100 MB in size and is created using incremental bulk processing from pandas and pyarrow v1.0.1. With the filters I am limiting the amount of files that are fetch to around 20.

      I am suspecting an issue with a limit in the total amount of threads that are spawning but I have been unable to resolve it by calling

      pyarrow.set_cpu_count(1) 

      Attachments

        1. gdb.txt.gz
          175 kB
          Anton Friberg

        Activity

          People

            westonpace Weston Pace
            AntonFriberg Anton Friberg
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: