Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.5.1
Description
Sedona fails to optimize spatial joins with complex join conditions. Here is an example:
df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)")) df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, id), 2)")) df1.createOrReplaceTempView("df1") df2.createOrReplaceTempView("df2") spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id AND df1.id < df2.id + 10").explain()
This join has a spatial condition ST_Intersects(df1.geom, df2.goem), but it is planned as a broadcast nested loop join:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastNestedLoopJoin BuildRight, Inner, (( **org.apache.spark.sql.sedona_sql.expressions.ST_Intersects** AND (id#434L > id#439L)) AND (id#434L < (id#439L + 10)))
:- Project [id#434L, **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#436]
: +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point** )
: +- Range (1, 1000, step=1, splits=10)
+- BroadcastExchange IdentityBroadcastMode, [plan_id=1157]
+- Project [id#439L, **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#441]
+- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** )
+- Range (1, 1000, step=1, splits=10)
If we specify a simpler join condition, Sedona will optimize the join correctly:
spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id").explain()
Output:
== Physical Plan == BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS, (id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500) :- SpatialIndex geom#495: geometry, RTREE, false, false : +- Project [id#493L, **org.apache.spark.sql.sedona_sql.expressions.ST_Point** AS geom#495] : +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point** ) : +- *(1) Range (1, 1000, step=1, splits=10) +- Project [id#498L, **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** AS geom#500] +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer** ) +- *(2) Range (1, 1000, step=1, splits=10)
Attachments
Issue Links
- links to