Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Right now some early runs with Arrowbench and the OT PR (https://github.com/apache/arrow/pull/12100) shows that we spend a fair amount of time in TPC-H queries on filter nodes. There are a few improvements we know could be made to our filtering approach at the moment. I'm creating this parent issue to help categorize and track those:
We can use a selection vector in our filters to reduce the amount of materialization needed. While long term we may want to support a selection vector throughout the exec plan a good start would be to use it when we encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x > 5 && y < 20)- If a filter if very selective then we may end up outputting a lot of very small batches. We could probably hold onto the data at the filter node until we've accumulated enough rows for a decent sized batch.
- The filter node is currently creating new thread tasks instead of appending its work onto an existing thread task.
- If we have a chain of filters we could potentially use runtime selectivity statistics / estimates to reorder our filters so that the most selective filters are evaluated first.
Attachments
Issue Links
- is related to
-
ARROW-14970 [C++][Compute] Replace ExecNode::InputReceived with ::MakeTask (Part 2)
- In Progress
1.
|
[C++] Investigate batching filter node output | Open | Unassigned | |
2.
|
[C++] Investigate reporting filter selectivity for filter order optimization | Open | Unassigned |