Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32290

NotInSubquery SingleColumn Optimize

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.1.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Normally,
      A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very very time consuming. For example, I've done TPCH benchmark lately, Query 16 almost took half of the entire TPCH 22Query execution Time. So i proposed that to do the following optimize.

      Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only single column in following pattern.

      case _@Or(
                  _@EqualTo(leftAttr: AttributeReference, rightAttr: AttributeReference),
                  _@IsNull(
                    _@EqualTo(_: AttributeReference, _: AttributeReference)
                  )
                )
      

      if buildSide rows is small enough, we can change build side data into a HashMap.
      so the M*N calculation can be optimized into M*log(N)

      I've done a benchmark job in 1TB TPCH, before apply the optimize
      Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it takes only 30s to finish.

      But this optimize only works on single column not in subquery, so i am here to seek advise whether the community need this update or not. I will do the pull request first, if the community member thought it's hack, it's fine to just ignore this request.

        Attachments

          Activity

            People

            • Assignee:
              leanken Leanken.Lin
              Reporter:
              leanken Leanken.Lin
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified