[SPARK-11066] Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.4.0, 1.4.1, 1.5.0, 1.5.1
Fix Version/s: 1.5.2, 1.6.0
Component/s: Scheduler, Spark Core, Tests
Labels:
None
Environment:

Multiple OS and platform types.
(Also observed by others, e.g. see External URL)

Flags:

Patch
External issue ID:
http://mail-archives.apache.org/mod_mbox/spark-dev/201508.mbox/%3CCAMAsSdJbkQE+g6HEAM_1J2j2DCPWNXBucUVPRBVZnJj2n1gg6Q@mail.gmail.com%3E

Description

The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, but whilst the job will fail and a SparkDriverExecutionException will be returned, a race condition exists as to whether the first task's (deliberately) thrown exception causes the job to fail - and having its causing exception set to the DAGSchedulerSuiteDummyException that was thrown as the setup of the misbehaving test - or second (and subsequent) tasks who equally end, but have instead the DAGScheduler's legitimate UnsupportedOperationException (a subclass of RuntimeException) returned instead as their causing exception. This race condition is likely associated with the vagaries of processing quanta, and expense of throwing two exceptions (under interpreter execution) per thread of control; this race is usually 'won' by the first task throwing the DAGSchedulerDummyException, as desired (and expected)... but not always.

The problem for the testcase is that the first assertion is largely concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest expert) capture all the causes of SparkDriverExecutionException that can legitimately arise from a correctly working (not crashed) DAGScheduler. Arguably, this assertion might test something of the DAGScheduler... but not all the possible outcomes for a working DAGScheduler. Nevertheless, this test - when comprising a multiple task job - will report as a failure when in fact the DAGScheduler is working-as-designed (and not crashed . Furthermore, the test is already failed before it actually tries to use the SparkContext a second time (for an arbitrary processing task), which I think is the real subject of the test?

The solution, I submit, is to ensure that the job is composed of just one task, and that single task will result in the call to the compromised ResultHandler causing the test's deliberate exception to be thrown and exercising the relevant (DAGScheduler) code paths. Given tasks are scoped by the number of partitions of an RDD, this could be achieved with a single partitioned RDD (indeed, doing so seems to exercise/would test some default parallelism support of the TaskScheduler?); the pull request offered, however, is based on the minimal change of just using a single partition of the 2 (or more) partition parallelized RDD. This will result in scheduling a job of just one task, one successful task calling the user-supplied compromised ResultHandler function, which results in failing the job and unambiguously wrapping our DAGSchedulerSuiteException inside a SparkDriverExecutionException; there are no other tasks that on running successfully will find the job failed causing the 'undesired' UnsupportedOperationException to be thrown instead. This, then, satisfies the test's setup assertion.

I have tested this hypothesis having parametised the number of partitions, N, used by the "misbehaved ResultHandler" job and have observed the 1 x DAGSchedulerSuiteException first, followed by the legitimate N-1 x UnsupportedOperationExceptions ... what propagates back from the job seems to simply become the result of the race between task threads and the intermittent failures observed.

Attachments

Issue Links

supercedes

SPARK-10976 java.lang.UnsupportedOperationException: taskSucceeded() called on a finished JobWaiter

Closed

links to

[Github] Pull Request #9076 (shellberg)

Activity

People

Assignee:: Dr Stephen A Hellberg

Reporter:: Dr Stephen A Hellberg

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Oct/15 15:39

Updated:: 15/Oct/15 17:08

Resolved:: 15/Oct/15 17:07