Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32461

Shuffled hash join improvement

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0, 3.2.0
    • None
    • SQL

    Description

      Shuffled hash join avoids sort compared to sort merge join. This advantage shows up obviously when joining large table in terms of saving CPU and IO (in case of external sort happens). In latest master trunk, shuffled hash join is disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, with favor of reducing risk of OOM. However shuffled hash join could be improved to a better stateĀ (validated in our internal fork). Creating this Jira to track overall progress.

      Attachments

        Activity

          People

            Unassigned Unassigned
            chengsu Cheng Su
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated: