[FLINK-12183] Job Cluster doesn't stop after cancel a running job in per-job Yarn mode - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.6.4, 1.7.2, 1.8.0
Fix Version/s: None
Component/s: Runtime / REST
Labels:
- pull-request-available

Description

The per-job Yarn cluster doesn't stop after cancel a running job if the job restarted many times, like 1000 times, in a short time.

The bug is in archiveExecutionGraph() phase before executing removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will exit unexpectedly with NullPointerException in archiveExecutionGraph() phase. It's hard to find that because here it only catches IOException. In SubtaskExecutionAttemptDetailsHandler and SubtaskExecutionAttemptAccumulatorsHandler, when calling archiveJsonWithPath() method, it will construct some json information about prior execution attempts but the index is from 0 which might be dropped index for the for loop. In default, it will return null when trying to get the prior execution attempt (AccessExecution attempt = subtask.getPriorExecutionAttempt).

Attachments

Issue Links

duplicates

FLINK-12247 fix NPE when writing an archive file to a FileSystem

Resolved

links to

GitHub Pull Request #8163

Activity

People

Assignee:: Unassigned

Reporter:: Yumeng Wang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Apr/19 10:05

Updated:: 28/Apr/19 16:47

Resolved:: 26/Apr/19 09:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m