Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32461 Shuffled hash join improvement
  3. SPARK-32649

Optimize BHJ/SHJ inner and semi join with empty hashed relation

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • SQL
    • None

    Description

      With `EmptyHashedRelation` introduced in https://github.com/apache/spark/pull/29389, it inspired me that there's a minor optimization we can apply to broadcast hash join and shuffled hash join if build side hashed relation is empty.

      If build side hashed relation is empty (i.e. build side is empty)

      1.inner join: we don't need to execute stream side at all, just return an empty iterator - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L152

      2.semi join: we don't need to execute stream side at all, just return an empty iterator - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L227 .

      This is not common that build side is empty, but in case it is, we can leverage it to not execute stream side at all for better query CPU/IO performance.

      Attachments

        Activity

          People

            chengsu Cheng Su
            chengsu Cheng Su
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: