Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32461

Shuffled hash join improvement

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0, 3.2.0
    • None
    • SQL

    Description

      Shuffled hash join avoids sort compared to sort merge join. This advantage shows up obviously when joining large table in terms of saving CPU and IO (in case of external sort happens). In latest master trunk, shuffled hash join is disabled by default with config "spark.sql.join.preferSortMergeJoin"=true, with favor of reducing risk of OOM. However shuffled hash join could be improved to a better stateĀ (validated in our internal fork). Creating this Jira to track overall progress.

      Attachments

        1.
        Coalesce bucketed tables for shuffled hash join if applicable Sub-task Resolved Cheng Su
        2.
        Preserve shuffled hash join build side partitioning Sub-task Resolved Cheng Su
        3.
        Preserve hash join (BHJ and SHJ) stream side ordering Sub-task Resolved Cheng Su
        4.
        Add handling for unique key in non-codegen hash join Sub-task Resolved Cheng Su
        5.
        Add code-gen for shuffled hash join Sub-task Resolved Cheng Su
        6.
        Support full outer join in shuffled hash join Sub-task Resolved Cheng Su
        7.
        Code-gen for full outer shuffled hash join Sub-task Resolved Cheng Su
        8.
        Fix the config value for shuffled hash join in test in-joins.sql Sub-task Resolved Cheng Su
        9.
        Record metrics of extra BitSet/HashSet in full outer shuffled hash join Sub-task Resolved Cheng Su
        10.
        Optimize BHJ/SHJ inner and semi join with empty hashed relation Sub-task Resolved Cheng Su
        11.
        Exercise code-gen enable/disable code paths for SHJ in join test suites Sub-task Resolved Cheng Su
        12.
        Ignore duplicated join keys when building relation for SEMI/ANTI shuffle hash join Sub-task Resolved Cheng Su
        13.
        A dynamic join operator to improve the join reliability Sub-task Resolved Unassigned
        14.
        Add hash probes metrics for shuffled hash join Sub-task In Progress Unassigned
        15.
        Introduce sort-based fallback mechanism for shuffled hash join Sub-task In Progress Unassigned
        16.
        Only codegen build side separately for shuffled hash join Sub-task Open Unassigned
        17.
        Introduce hybrid join for sort merge join and shuffled hash join in AQE Sub-task Open Unassigned
        18.
        Support left outer join build left or right outer join build right in shuffled hash join Sub-task Resolved Szehon Ho
        19.
        Code-gen for build side outer shuffled hash join Sub-task Resolved Szehon Ho

        Activity

          People

            Unassigned Unassigned
            chengsu Cheng Su
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated: