[SPARK-10381] Infinite loop when OutputCommitCoordination is enabled and OutputCommitter.commitTask throws exception - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.3.1, 1.4.1, 1.5.0
Fix Version/s: 1.3.2, 1.4.2, 1.5.1, 1.6.0
Component/s: Scheduler, Spark Core
Labels:
None

Target Version/s:

1.3.2, 1.4.2, 1.5.1

Description

When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

Attachments

Issue Links

links to

[Github] Pull Request #8544 (JoshRosen)

[Github] Pull Request #8789 (JoshRosen)

[Github] Pull Request #8790 (JoshRosen)

Activity

People

Assignee:: Josh Rosen

Reporter:: Josh Rosen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/Aug/15 23:41

Updated:: 17/May/20 17:48

Resolved:: 16/Sep/15 00:42