Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32461

Shuffled hash join improvement

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:

      Description

      Shuffled hash join avoids sort compared to sort merge join. This advantage shows up obviously when joining large table in terms of saving CPU and IO (in case of external sort happens). In latest master trunk, shuffled hash join is disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, with favor of reducing risk of OOM. However shuffled hash join could be improved to a better stateĀ (validated in our internal fork). Creating this Jira to track overall progress.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              chengsu Cheng Su
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated: