Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-22

Failing on join if a geodataframe is empty

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.1.0

    Description

      When I try to do an Intersect join between a geodataframe that has a column of points, with a geodataframe that holds a single polygon, Spark fails in the case of the first dataframe having no rows in it.

       

      Is there a way to mitigate the error and returning an empty dataframe?

      I really don't want to persist and then count to avoid the crash.

       

      line that fails:

      df_filtered = spark.sql('SELECT dfg.* from dfg, poly as p where ST_Intersects(dfg.geom, p.geometry)')

       

      The error I get:

      Py4JJavaError: An error occurred while calling o2701.showString.
      : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
      Exchange hashpartitioning(geocoding_country#170754, 200), ENSURE_REQUIREMENTS, [id=#6604]
      +- *(3) HashAggregate(keys=[geocoding_country#170754], functions=[], output=[geocoding_country#170754])
         +- *(3) Project [geocoding_country#170754]
            +- RangeJoin geom#171688: geometry, geometry#74770: geometry, true
               :- Project [geocoding_country#170754, st_point(cast(longitude#170757 as decimal(24,20)), cast(latitude#170756 as decimal(24,20))) AS geom#171688]
               :  +- *(1) Filter (((((((isnotnull(longitude#170757) AND isnotnull(latitude#170756)) AND isnotnull(geocoding_country#170754)) AND (longitude#170757 >= 2.521799927545686)) AND (longitude#170757 <= 6.374525187000074)) AND (latitude#170756 >= 49.49522288100006)) AND (latitude#170756 <= 51.49623769100005)) AND (geocoding_country#170754 = bobostan))
               :     +- *(1) ColumnarToRow
               :        +- FileScan parquet [geocoding_country#170754,latitude#170756,longitude#170757] Batched: true, DataFilters: [isnotnull(longitude#170757), isnotnull(latitude#170756), isnotnull(geocoding_country#170754), (l..., Format: Parquet, Location: InMemoryFileIndex[file:/Users/itamar/Downloads/part-00024-eb04d380-134a-402d-90c2-81380003f43e.c000], PartitionFilters: [], PushedFilters: [IsNotNull(longitude), IsNotNull(latitude), IsNotNull(geocoding_country), GreaterThanOrEqual(long..., ReadSchema: struct<geocoding_country:string,latitude:double,longitude:double>
               +- Exchange RoundRobinPartitioning(1), REPARTITION_WITH_NUM, [id=#6598]
                  +- *(2) Scan ExistingRDD[geometry#74770]
      
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:163)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      	at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:525)
      	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:453)
      	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:452)
      	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:496)
      	at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:746)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
      	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:439)
      	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425)
      	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
      	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
      	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722)
      	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
      	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
      	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
      	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
      	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
      	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
      	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
      	at org.apache.spark.sql.Dataset.head(Dataset.scala:2722)
      	at org.apache.spark.sql.Dataset.take(Dataset.scala:2929)
      	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
      	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
      	at sun.reflect.GeneratedMethodAccessor173.invoke(Unknown Source)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      	at py4j.Gateway.invoke(Gateway.java:282)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:238)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: java.lang.Exception: [AbstractSpatialRDD][spatialPartitioning] SpatialRDD boundary is null. Please call analyze() first.
      	at org.apache.sedona.core.spatialRDD.SpatialRDD.spatialPartitioning(SpatialRDD.java:220)
      	at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.doSpatialPartitioning(TraitJoinQueryExec.scala:185)
      	at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.doSpatialPartitioning$(TraitJoinQueryExec.scala:183)
      	at org.apache.spark.sql.sedona_sql.strategy.join.RangeJoinExec.doSpatialPartitioning(RangeJoinExec.scala:37)
      	at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.doExecute(TraitJoinQueryExec.scala:94)
      	at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.doExecute$(TraitJoinQueryExec.scala:56)
      	at org.apache.spark.sql.sedona_sql.strategy.join.RangeJoinExec.doExecute(RangeJoinExec.scala:37)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      	at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:525)
      	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:453)
      	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:452)
      	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:496)
      	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
      	at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:746)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
      	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
      	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
      	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:118)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:118)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency$lzycompute(ShuffleExchangeExec.scala:151)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.shuffleDependency(ShuffleExchangeExec.scala:149)
      	at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:166)
      	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
      	... 44 more
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              itamarl Itamar Landsman
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h