Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6072

Enable hash joins for null-safe equality predicates

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.2.1
    • None
    • SQL
    • None

    Description

      Currently joins such as
      A join B on A.x = B.x AND A.y <=> B.y
      are evaluated as hash join on just x followed by filter on y.
      This causes a skew problem (very long join) when a particular value of x has a high cardinality even though (x, y) is evenly distributed

      Can we implement is as a hash join on (X, Option(Y))? This will eliminate the skew in this case

      Imagine a join:
      People as p1 join People as p2 on p1.name = p2.name and p1.address <=> p2.address

      (very small percentage of people has unknown address)

      This causes a skewed join on popular names such as "Mary Brown" if we hash on names alone, but will not cause a skew if we hash on (Name, Option(Address))

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              dimazhiyanov Dima Zhiyanov
              Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: