[SEDONA-532] Sedona Spark SQL optimizer cannot optimize joins with complex conditions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.5.1
Fix Version/s: 1.6.0
Labels:
- pull-request-available

Description

Sedona fails to optimize spatial joins with complex join conditions. Here is an example:

df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)"))
df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, id), 2)"))
df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")
spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id AND df1.id < df2.id + 10").explain()

This join has a spatial condition ST_Intersects(df1.geom, df2.goem), but it is planned as a broadcast nested loop join:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastNestedLoopJoin BuildRight, Inner, (( **org.apache.spark.sql.sedona_sql.expressions.ST_Intersects**   AND (id#434L > id#439L)) AND (id#434L < (id#439L + 10)))
   :- Project [id#434L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#436]
   :  +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
   :     +- Range (1, 1000, step=1, splits=10)
   +- BroadcastExchange IdentityBroadcastMode, [plan_id=1157]
      +- Project [id#439L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**   AS geom#441]
         +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
            +- Range (1, 1000, step=1, splits=10)

If we specify a simpler join condition, Sedona will optimize the join correctly:

spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id").explain()

Output:

== Physical Plan ==
BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS, (id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500)
:- SpatialIndex geom#495: geometry, RTREE, false, false
:  +- Project [id#493L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#495]
:     +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
:        +- *(1) Range (1, 1000, step=1, splits=10)
+- Project [id#498L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**   AS geom#500]
   +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
      +- *(2) Range (1, 1000, step=1, splits=10)

Attachments

Issue Links

links to

GitHub Pull Request #1325

Activity

People

Assignee:: Unassigned

Reporter:: Kristin Cowalcijk

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Apr/24 12:09

Updated:: 01/May/24 03:37

Resolved:: 01/May/24 03:37

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m