[ARROW-11857] [Python] Resource temporarily unavailable when using the new Dataset API with Pandas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 4.0.0
Component/s: Python
Labels:
None
Environment:
OS: Debian GNU/Linux 10 (buster) x86_64
Kernel: 4.19.0-14-amd64
CPU: Intel i7-6700K (8) @ 4.200GHz
Memory: 32122MiB
Python: v3.7.3

External issue URL:
https://github.com/apache/arrow/issues/27703

Description

When using the new Dataset API under v3.0.0 it instantly crashes with

 terminate called after throwing an instance of 'std::system_error'
 what(): Resource temporarily unavailable

This does not happen in an earlier version. The error message leads me to believe that the issue is not on the Python side but might be in the C++ libraries.

As background, I am using the new Dataset API by calling the following

s3_fs = fs.S3FileSystem(<minio credentials>)
dataset = pq.ParquetDataset(
        f"{bucket}/{base_path}",
        filesystem=s3_fs,
        partitioning="hive",
        use_legacy_dataset=False,
        filters=filters
)
dataframe = dataset.read_pandas(columns=columns).to_pandas()

The dataset itself contains 10,000s of files around 100 MB in size and is created using incremental bulk processing from pandas and pyarrow v1.0.1. With the filters I am limiting the amount of files that are fetch to around 20.

I am suspecting an issue with a limit in the total amount of threads that are spawning but I have been unable to resolve it by calling

pyarrow.set_cpu_count(1)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

gdb.txt.gz
07/Apr/21 09:30
175 kB
Anton Friberg

Activity

People

Assignee:: Weston Pace

Reporter:: Anton Friberg

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/Mar/21 11:01

Updated:: 11/Jan/23 08:22

Resolved:: 07/Apr/21 11:06