[SEDONA-218] Flaky test caused by improper handling of null struct values in Adapter.toDf - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Labels:
- pull-request-available

Description

Test case can convert JavaPairRDD to DataFrame with user-supplied schema randomly fails when running on my laptop, and this failure could be consistently reproduced after switching the geometry serde to a faster alternative I'm currently working on:

- can convert JavaPairRDD to DataFrame with user-supplied schema *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2290.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2290.0 (TID 3859) (kontinuation executor driver): scala.MatchError: StructType(StructField(structText,StringType,true),StructField(structInt,IntegerType,true),StructField(structBool,BooleanType,true)) (of class org.apache.spark.sql.types.StructType)
	at org.apache.sedona.sql.utils.Adapter$.parseString(Adapter.scala:287)
	at org.apache.sedona.sql.utils.Adapter$.$anonfun$castRowToSchema$1(Adapter.scala:271)
	at scala.collection.immutable.List.map(List.scala:297)
	at org.apache.sedona.sql.utils.Adapter$.castRowToSchema(Adapter.scala:268)
	at org.apache.sedona.sql.utils.Adapter$.$anonfun$toDf$7(Adapter.scala:189)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

This test fails only when the only row processed by Adapter.toDF contains a null struct value, and the test is flaky because not all rows have null struct values, and the order of rows in the data frame is non-deterministic. We can fix the null struct value handling problem and make the test code not only process the first row but all of them.

Attachments

Issue Links

links to

GitHub Pull Request #733

Activity

People

Assignee:: Unassigned

Reporter:: Kristin Cowalcijk

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 14/Dec/22 08:42

Updated:: 13/Mar/23 21:39

Resolved:: 04/Jan/23 22:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m