Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27359

Joins on some array functions can be optimized

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Optimizer, SQL
    • None

    Description

      I encounter these cases frequently, and implemented the optimization manually (as shown here). If others experience this as well, perhaps it would be good to add appropriate tree transformations into catalyst.

      Case 1

      A join like this:

      left.join(
        right,
        arrays_overlap(left("a"), right("b"))     // Creates a cartesian product in the logical plan
      )
      

      will produce the same results as:

      {
        val leftPrime = left.withColumn("exploded_a", explode(col("a")))
        val rightPrime = right.withColumn("exploded_b", explode(col("b")))
      
        leftPrime.join(
          rightPrime,
          leftPrime("exploded_a") === rightPrime("exploded_b")
            // Equijoin doesn't produce cartesian product
        ).drop("exploded_a", "exploded_b").distinct
      }
      

      Case 2

      A join like this:

      left.join(
        right,
        array_contains(left("arr"), right("value")) // Cartesian product in logical plan
      )
      

      will produce the same results as:

      {
        val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))
      
        leftPrime.join(
          right,
          leftPrime("exploded_arr") === right("value") // Fast equijoin
        ).drop("exploded_arr").distinct
      }
      

      Case 3

      A join like this:

      left.join(
        right,
        array_contains(right("arr"), left("value")) // Cartesian product in logical plan
      )
      

      will produce the same results as:

      {
        val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))
      
        left.join(
          rightPrime,
          left("value") === rightPrime("exploded_arr") // Fast equijoin
        ).drop("exploded_arr").distinct
      }
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nikvanderhoof Nikolas Vanderhoof
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: