Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Invalid
-
1.4.0, 1.4.1, 1.5.0, 1.5.1
-
None
-
None
-
Has arisen in a variety of OSes, and platforms.
It is highly intermittent, however, but annoying - we've seen it through 1.4.x and 1.5.x releases.My environment of current interest happens to be zLinux, which potentially represents a higher degree of concurrency than many others; I'm using an IBM Java 1.8.0, but this problem has been experienced on other environments, with other vendor's Java, e.g. see External URL
Has arisen in a variety of OSes, and platforms. It is highly intermittent, however, but annoying - we've seen it through 1.4.x and 1.5.x releases. My environment of current interest happens to be zLinux, which potentially represents a higher degree of concurrency than many others; I'm using an IBM Java 1.8.0, but this problem has been experienced on other environments, with other vendor's Java, e.g. see External URL
Description
This issue is surfaced from the "misbehaved resultHandler should not crash DAGScheduler and SparkContext" test, part of the DAGSchedulerSuite. I've been particularly trying to determine the causality for this problem, when it arises (as infrequently as it is), and surfacing some of the state transitions in the JobWaiter code responsible for throwing the j.l.UnsupportedOperationException.
Of relevance, the UnsupportedOperationException is being thrown on the first occasion of the taskSucceded() being called (after object instantiation) and the executing thread throws the exception because it is finding _jobFinished to be 'true' - yes, before any of the tasks being waited upon have reported their success/failure. That is, _jobFinished (a volatile variable) is being perceived to be set true during object initialisation... as if its value is/was based on the boolean expression 'totalTask==0' (totalTask is one of the formal arguments of the class constructor). In fact, the right/correct values for the initial state of these variables during the relevant test of DAGSchedulerSuite intended is totalTask==2, and hence should be _jobFinished=false. We are apparently seeing a race condition amongst the read and write operations between what threads are doing; only the volatile annotation for _jobFinished is providing any thread safety?
The DAGSchedulerSuite test then fails because the ScalaTest asserts receiving a deliberately thrown exception: DAGSchedulerSuiteDummyException, from the ResultHandler function, albeit as a check on the setup of the test? Instead in our problem scenario, it first captures the RuntimeException - the UnsupportedOperationException - produced from the (incompletely initialised?) JobWaiter code.
The test suggests that the objective is that the DAGScheduler and SparkContext are 'not crashed'... it proceeds to conduct a count operation on the SparkContext, which both succeed... that is, neither are apparently crashed... which should be a positive outcome?
It would be... except for this occasional RuntimeException to cloud the issue.
(Is this deliberate.. or is this a deficiency of the current testcase?)
- misbehaved resultHandler should not crash DAGScheduler and SparkContext *** FAILED ***
java.lang.UnsupportedOperationException: taskSucceeded() called on a finished JobWaiter was not instance of org.apache.spark.scheduler.DAGSchedulerSuiteDummyException (DAGSchedulerSuite.scala:869)
Failed: failing job... exception: org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
Succeeded: 0 (0 of 2)
Succeeded: 1 (1 of 2)
(My additional diagnostics presented here are minimal... I've surfaced the exception passed in the jobFailed() routine; and the index, finishedTasks, and (.. of ..), totalTasks as the "Succeeded" message from taskSucceeded().)
I thought I was close - I still might be - to proposing a fix for this issue, although the intermittency of this issue is hampering my efforts. Nevertheless, I wanted to submit my hypothesis for any feedback.
Attachments
Issue Links
- is superceded by
-
SPARK-11066 Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter
- Resolved