Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15519

[C++] Investigate potential performance improvements for the filter node

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++
    • None

    Description

      Right now some early runs with Arrowbench and the OT PR (https://github.com/apache/arrow/pull/12100) shows that we spend a fair amount of time in TPC-H queries on filter nodes. There are a few improvements we know could be made to our filtering approach at the moment. I'm creating this parent issue to help categorize and track those:

      • We can use a selection vector in our filters to reduce the amount of materialization needed. While long term we may want to support a selection vector throughout the exec plan a good start would be to use it when we encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x > 5 && y < 20)
      • If a filter if very selective then we may end up outputting a lot of very small batches. We could probably hold onto the data at the filter node until we've accumulated enough rows for a decent sized batch.
      • The filter node is currently creating new thread tasks instead of appending its work onto an existing thread task.
      • If we have a chain of filters we could potentially use runtime selectivity statistics / estimates to reorder our filters so that the most selective filters are evaluated first.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: