Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Abandoned
-
2.3.4, 3.0.0
-
None
Description
We got a exception when running a spark application under spark 2.3.4 and spark 3.0 using conf : spark.shuffle.useOldFetchProtocol=true, the application failed due to stage fatch failed and the stage not retry.
code like following:
val Array(input) = args val sparkConf = new SparkConf().setAppName("Spark Fatch Failed Test") // for running directly in IDE sparkConf.setIfMissing("spark.master", "local[2]") val sc = new SparkContext(sparkConf) val lines = sc.textFile(input) .repartition(1) .map(data => data.trim) .repartition(1) val doc = lines.map(data => (data, 1)).reduceByKey(_ + _).collect()
The application DAG like following:
If stage 3 failed due to fatch failed, the application will not retry stage 2 and stage 3 and fail the job. Because spark think stage 2 and stage 3 are non-retryable, rdds in stage 2 and stage 3 is INDETERMINATE.
Actually, if shuffle result belongs to stage 1 exist completely, stage 2 and stage 3 are retryable, because rdds in them is not order-sensitive. If allow stage 2 and stage 3 to retry, we have trouble in handling DAGScheduler.getMissingParentStages. And i am not sure if DAGScheduler.getMissingParentStages breaks the rule that INDETERMINATE rdd non-retryable.
I would appreciate it if someone would reply.