Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
The following query throws a NullPointerException.
select /*+ BROADCAST(t2) */ * from t1 left join t2 on st_intersects(t1.geom, t2.geom)
java.lang.NullPointerException at org.locationtech.jts.io.WKBReader.read(WKBReader.java:159) at org.apache.sedona.sql.utils.GeometrySerializer$.deserialize(GeometrySerializer.scala:50) at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryBase.$anonfun$toSpatialRDD$1(TraitJoinQueryBase.scala:45) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
The failure happens when the streaming side is mapped to a SpatialRDD. The NPE doesn't happen for inner join with null geometries. I suspect Spark is pushing a not null predicate since rows with null geometries would be excluded in an inner join anyway.
Looking at the code I suspect there are more errors in the new broadcast join types. InternalRow is encoded in the user data field in the geometry. That doesn't work if the geometry is null. For a left join the InternalRow on the left side has to be emitted even if the geometry is null. Instead of using a SpatialRDD it might be better to map the RDD[InternalRow] to a RDD[Pair[Geometry, InternalRow]] where Geometry might be null.
Attachments
Issue Links
- links to