Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19256 Hive bucketing write support
  3. SPARK-17570

Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.0.0
    • None
    • SQL

    Description

      In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the input tables and output table has same number of buckets. However, unequal bucketing will always lead to `Sort` and `Exchange`. If the number of buckets in the output table is a factor of the buckets in the input table, we should be able to avoid `Sort` and `Exchange` and directly join those.
      eg.

      Assume Input1, Input2 and Output be bucketed + sorted tables over the same columns but with different number of buckets. Input1 has 8 buckets, Input1 has 4 buckets and Output has 4 buckets. Since hash-partitioning is done using Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in the same task, it would give the bucket 0 of output table.

      Input1   (0, 4)      (1, 3)      (2, 5)       (3, 7)
      Input2   (0, 4, 8)   (1, 3, 9)   (2, 5, 10)   (3, 7, 11)
      Output   (0)         (1)         (2)          (3)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tejasp Tejas Patil
              Votes:
              6 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: