Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-5559

Improve RepeatUnion by discarding duplicates at TableSpool level

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • core

    Description

      Currently, RepeatUnion operator with all=false keeps track of the elements that it has returned in order to discard duplicates. However, the TableSpool operators that are right below it do not have such control. In certain scenarios, duplicates are returned by the TableSpool current iteration, discarded by the RepeatUnion, but have been already "fed back" by the TableSpool into the next iteration, causing unnecessary processing.
      We can optimize this scenario by keeping track of the duplicates inside/before the TableSpool too (note: we still need to keep track of duplicates at RepeatUnion level, because that is the only place where we can detect a potential "global duplicate" of an element: returned by the LHS and then also by the RHS, or by two different iterations of the RHS).

      A PoC testing this improvement on a downstream project showed that certain queries can go from ~40s down to ~1s.

      Attachments

        Issue Links

          Activity

            People

              rubenql Ruben Q L
              rubenql Ruben Q L
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m