Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
8.0.0
-
None
Description
Multi-threaded read performance in Arrow's GCS file system implementation currently is relatively low. Given the high latency of cloud blob systems like GCS, a common strategy is to use many concurrent readers (if the system has enough memory to support that), e.g. using 100 threads.
The GCS client library offers a ConnectionPoolSize option. If this option is set to a value that's too low, concurrency is throttled. At the moment, this is not exposed in GcsOptions, consequently limiting multi-threaded throughput.
Instead of exposing this option, an alternative implementation strategy could be to use the same value as set by arrow::io::SetIOThreadPoolCapacity.