[ARROW-10569] [C++][Python] Poor Table filtering performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0.0
Component/s: C++, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/26533

Description

From the mailing list

import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

num_rows = 10_000_000
data = np.random.randn(num_rows)

df = pd.DataFrame({'data{}'.format(i): data
                   for i in range(100)})

df['key'] = np.random.randint(0, 100, size=num_rows)

rb = pa.record_batch(df)
t = pa.table(df)

I found that the performance of filtering a record batch is very similar:

In [22]: timeit df[df.key == 5]
71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Whereas the performance of filtering a table is absolutely abysmal (no
idea what's going on here)

In [23]: %timeit t.filter(pc.equal(t[-1], 5))
961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

https://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3Ehttps://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3E

Attachments

Issue Links

links to

GitHub Pull Request #8777

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 12/Nov/20 15:58

Updated:: 11/Jan/23 08:14

Resolved:: 02/Dec/20 17:30

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: