Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
3.0.0
-
None
-
OS: Debian GNU/Linux 10 (buster) x86_64
Kernel: 4.19.0-14-amd64
CPU: Intel i7-6700K (8) @ 4.200GHz
Memory: 32122MiB
Python: v3.7.3
Description
When using the new Dataset API under v3.0.0 it instantly crashes with
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
This does not happen in an earlier version. The error message leads me to believe that the issue is not on the Python side but might be in the C++ libraries.
As background, I am using the new Dataset API by calling the following
s3_fs = fs.S3FileSystem(<minio credentials>) dataset = pq.ParquetDataset( f"{bucket}/{base_path}", filesystem=s3_fs, partitioning="hive", use_legacy_dataset=False, filters=filters ) dataframe = dataset.read_pandas(columns=columns).to_pandas()
The dataset itself contains 10,000s of files around 100 MB in size and is created using incremental bulk processing from pandas and pyarrow v1.0.1. With the filters I am limiting the amount of files that are fetch to around 20.
I am suspecting an issue with a limit in the total amount of threads that are spawning but I have been unable to resolve it by calling
pyarrow.set_cpu_count(1)