Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-532

Sedona Spark SQL optimizer cannot optimize joins with complex conditions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5.1
    • 1.6.0

    Description

      Sedona fails to optimize spatial joins with complex join conditions. Here is an example:

      df1 = spark.range(1, 1000).withColumn("geom", expr("ST_Point(id, id)"))
      df2 = spark.range(1, 1000).withColumn("geom", expr("ST_Buffer(ST_Point(id, id), 2)"))
      df1.createOrReplaceTempView("df1")
      df2.createOrReplaceTempView("df2")
      spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id AND df1.id < df2.id + 10").explain()
      

      This join has a spatial condition ST_Intersects(df1.geom, df2.goem), but it is planned as a broadcast nested loop join:

      == Physical Plan ==
      AdaptiveSparkPlan isFinalPlan=false
      +- BroadcastNestedLoopJoin BuildRight, Inner, (( **org.apache.spark.sql.sedona_sql.expressions.ST_Intersects**   AND (id#434L > id#439L)) AND (id#434L < (id#439L + 10)))
         :- Project [id#434L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#436]
         :  +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
         :     +- Range (1, 1000, step=1, splits=10)
         +- BroadcastExchange IdentityBroadcastMode, [plan_id=1157]
            +- Project [id#439L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**   AS geom#441]
               +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
                  +- Range (1, 1000, step=1, splits=10)
      

      If we specify a simpler join condition, Sedona will optimize the join correctly:

      spark.sql("SELECT * FROM df1 JOIN df2 ON ST_Intersects(df1.geom, df2.geom) AND df1.id > df2.id").explain()
      

      Output:

      == Physical Plan ==
      BroadcastIndexJoin geom#500: geometry, LeftSide, LeftSide, Inner, INTERSECTS, (id#493L > id#498L) ST_INTERSECTS(geom#495, geom#500)
      :- SpatialIndex geom#495: geometry, RTREE, false, false
      :  +- Project [id#493L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Point**   AS geom#495]
      :     +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Point**  )
      :        +- *(1) Range (1, 1000, step=1, splits=10)
      +- Project [id#498L,  **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**   AS geom#500]
         +- Filter isnotnull( **org.apache.spark.sql.sedona_sql.expressions.ST_Buffer**  )
            +- *(2) Range (1, 1000, step=1, splits=10)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kontinuation Kristin Cowalcijk
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m