XML

Word

Printable

JSON

Details

Type: Improvement
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: Optimizer, SQL
Labels:
None

Description

I encounter these cases frequently, and implemented the optimization manually (as shown here). If others experience this as well, perhaps it would be good to add appropriate tree transformations into catalyst.

Case 1

A join like this:

left.join(
  right,
  arrays_overlap(left("a"), right("b"))     // Creates a cartesian product in the logical plan
)

will produce the same results as:

{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
    rightPrime,
    leftPrime("exploded_a") === rightPrime("exploded_b")
      // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}

Case 2

A join like this:

left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical plan
)

will produce the same results as:

{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
    right,
    leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}

Case 3

A join like this:

left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical plan
)

will produce the same results as:

{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
    rightPrime,
    left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}

Attachments

Issue Links

links to

GitHub Pull Request #24563

Activity

People

Assignee:: Unassigned

Reporter:: Nikolas Vanderhoof

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Apr/19 22:58

Updated:: 16/Mar/20 22:55