Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-218

Flaky test caused by improper handling of null struct values in Adapter.toDf

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • None
    • 1.4.0

    Description

      Test case can convert JavaPairRDD to DataFrame with user-supplied schema randomly fails when running on my laptop, and this failure could be consistently reproduced after switching the geometry serde to a faster alternative I'm currently working on:

      - can convert JavaPairRDD to DataFrame with user-supplied schema *** FAILED ***
        org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2290.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2290.0 (TID 3859) (kontinuation executor driver): scala.MatchError: StructType(StructField(structText,StringType,true),StructField(structInt,IntegerType,true),StructField(structBool,BooleanType,true)) (of class org.apache.spark.sql.types.StructType)
      	at org.apache.sedona.sql.utils.Adapter$.parseString(Adapter.scala:287)
      	at org.apache.sedona.sql.utils.Adapter$.$anonfun$castRowToSchema$1(Adapter.scala:271)
      	at scala.collection.immutable.List.map(List.scala:297)
      	at org.apache.sedona.sql.utils.Adapter$.castRowToSchema(Adapter.scala:268)
      	at org.apache.sedona.sql.utils.Adapter$.$anonfun$toDf$7(Adapter.scala:189)
      	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
      	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
      	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
      	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
      	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
      	at org.apache.spark.scheduler.Task.run(Task.scala:136)
      	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
      	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:750)
      

      This test fails only when the only row processed by Adapter.toDF contains a null struct value, and the test is flaky because not all rows have null struct values, and the order of rows in the data frame is non-deterministic. We can fix the null struct value handling problem and make the test code not only process the first row but all of them.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kontinuation Kristin Cowalcijk
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m