[ARROW-15519] [C++] Investigate potential performance improvements for the filter node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/30992

Description

Right now some early runs with Arrowbench and the OT PR (https://github.com/apache/arrow/pull/12100) shows that we spend a fair amount of time in TPC-H queries on filter nodes. There are a few improvements we know could be made to our filtering approach at the moment. I'm creating this parent issue to help categorize and track those:

We can use a selection vector in our filters to reduce the amount of materialization needed. While long term we may want to support a selection vector throughout the exec plan a good start would be to use it when we encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x > 5 && y < 20)
If a filter if very selective then we may end up outputting a lot of very small batches. We could probably hold onto the data at the filter node until we've accumulated enough rows for a decent sized batch.
The filter node is currently creating new thread tasks instead of appending its work onto an existing thread task.
If we have a chain of filters we could potentially use runtime selectivity statistics / estimates to reorder our filters so that the most selective filters are evaluated first.

Attachments

Issue Links

is related to

ARROW-14970 [C++][Compute] Replace ExecNode::InputReceived with ::MakeTask (Part 2)

In Progress

Sub-Tasks

1.	[C++] Investigate batching filter node output		Open	Unassigned
2.	[C++] Investigate reporting filter selectivity for filter order optimization		Open	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 01/Feb/22 21:02

Updated:: 11/Jan/23 11:37