Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10976

java.lang.UnsupportedOperationException: taskSucceeded() called on a finished JobWaiter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Invalid
    • 1.4.0, 1.4.1, 1.5.0, 1.5.1
    • None
    • Scheduler, Spark Core
    • None

    Description

      This issue is surfaced from the "misbehaved resultHandler should not crash DAGScheduler and SparkContext" test, part of the DAGSchedulerSuite. I've been particularly trying to determine the causality for this problem, when it arises (as infrequently as it is), and surfacing some of the state transitions in the JobWaiter code responsible for throwing the j.l.UnsupportedOperationException.

      Of relevance, the UnsupportedOperationException is being thrown on the first occasion of the taskSucceded() being called (after object instantiation) and the executing thread throws the exception because it is finding _jobFinished to be 'true' - yes, before any of the tasks being waited upon have reported their success/failure. That is, _jobFinished (a volatile variable) is being perceived to be set true during object initialisation... as if its value is/was based on the boolean expression 'totalTask==0' (totalTask is one of the formal arguments of the class constructor). In fact, the right/correct values for the initial state of these variables during the relevant test of DAGSchedulerSuite intended is totalTask==2, and hence should be _jobFinished=false. We are apparently seeing a race condition amongst the read and write operations between what threads are doing; only the volatile annotation for _jobFinished is providing any thread safety?

      The DAGSchedulerSuite test then fails because the ScalaTest asserts receiving a deliberately thrown exception: DAGSchedulerSuiteDummyException, from the ResultHandler function, albeit as a check on the setup of the test? Instead in our problem scenario, it first captures the RuntimeException - the UnsupportedOperationException - produced from the (incompletely initialised?) JobWaiter code.

      The test suggests that the objective is that the DAGScheduler and SparkContext are 'not crashed'... it proceeds to conduct a count operation on the SparkContext, which both succeed... that is, neither are apparently crashed... which should be a positive outcome?
      It would be... except for this occasional RuntimeException to cloud the issue.
      (Is this deliberate.. or is this a deficiency of the current testcase?)

      • misbehaved resultHandler should not crash DAGScheduler and SparkContext *** FAILED ***
        java.lang.UnsupportedOperationException: taskSucceeded() called on a finished JobWaiter was not instance of org.apache.spark.scheduler.DAGSchedulerSuiteDummyException (DAGSchedulerSuite.scala:869)
        Failed: failing job... exception: org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
        Succeeded: 0 (0 of 2)
        Succeeded: 1 (1 of 2)

      (My additional diagnostics presented here are minimal... I've surfaced the exception passed in the jobFailed() routine; and the index, finishedTasks, and (.. of ..), totalTasks as the "Succeeded" message from taskSucceeded().)

      I thought I was close - I still might be - to proposing a fix for this issue, although the intermittency of this issue is hampering my efforts. Nevertheless, I wanted to submit my hypothesis for any feedback.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shellberg Dr Stephen A Hellberg
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: