[ARROW-9458] [Python] Dataset Scanner is single-threaded only - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.0.0
Component/s: Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25531

Description

I'm not sure this is a misunderstanding, or a compilation issue (flags?) or an issue in the C++ layer.

I have 1000 parquet files with a total of 1 billion rows (1 million rows each file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 columns efficiently (vaex use case).

import pyarrow.parquet
import pyarrow as pa
import pyarrow.dataset as ds
import glob
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
scanned = 0
for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=True):
    for record_batch in scan_task.execute():
        scanned += record_batch.num_rows
scanned

This only seems to use 1 cpu.

Using a threadpool from Python:

# %%timeit
import concurrent.futures
pool = concurrent.futures.ThreadPoolExecutor()
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
def process(scan_task):
    scan_count = 0
    for record_batch in scan_task.execute():
        scan_count += len(record_batch)
    return scan_count
sum(pool.map(process, ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=False)))

Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu).

py-spy (profiler for Python) shows no GIL, so this might be something at the C++ layer.

Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a problem on this system (it actually all comes from OS cache, no disk read observed)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2020-07-14-14-31-29-943.png
14/Jul/20 12:31
178 kB
Maarten Breddels
image-2020-07-14-14-38-16-767.png
14/Jul/20 12:38
190 kB
Maarten Breddels

Issue Links

links to

GitHub Pull Request #7756

Activity

People

Assignee:: Maarten Breddels

Reporter:: Maarten Breddels

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 14/Jul/20 08:44

Updated:: 11/Jan/23 08:06

Resolved:: 14/Jul/20 20:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: