Details

Improvement

Status: Closed

Major

Resolution: Done

None

None
Description
I made a smaller, simplified example:
arr = pa.array(np.random.randn(1_000_000)) # mask with only few True values mask1 = np.zeros(len(arr), dtype=bool) mask1[np.random.randint(len(arr), size=100)] = True mask1_pa = pa.array(mask1) # mask with larger proportion of True values mask2 = np.zeros(len(arr), dtype=bool) mask2[np.random.randint(len(arr), size=10_000)] = True mask2_pa = pa.array(mask2)
Doing timings of doing a Arrow Filter kernel vs using numpy to convert the mask into indices and then using a Take kernel:
# mask 1 In [3]: %timeit arr.filter(mask1_pa) 132 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [4]: %%timeit ...: indices = np.nonzero(mask1)[0] ...: arr.take(indices) 114 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # mask 2 In [8]: %timeit arr.filter(mask2_pa) 711 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [9]: %%timeit ...: indices = np.nonzero(mask2)[0] ...: arr.take(indices) 333 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So in the first case, both are quite similar in timing. But in the second case, the numpy+take version is faster.
I know this might depend on a lot on the actual proportion of True values and how they are positioned in the array (random vs concentrated) etc, so there is probably not a general rule of what should be faster.
But, it still seems a potential indication that things can be optimized in the Filter kernel.
Attachments
Issue Links
 is fixed by

ARROW10696 [C++] Investigate a bit run reader that would only return runs of set bits
 Resolved
 is related to

ARROW10474 [C++] Implement vector kernel that transforms a boolean mask into selection indices
 Open