Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12183

Job Cluster doesn't stop after cancel a running job in per-job Yarn mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.6.4, 1.7.2, 1.8.0
    • None
    • Runtime / REST

    Description

      The per-job Yarn cluster doesn't stop after cancel a running job if the job restarted many times, like 1000 times, in a short time.

      The bug is in archiveExecutionGraph() phase before executing removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will exit unexpectedly with NullPointerException in archiveExecutionGraph() phase. It's hard to find that because here it only catches IOException. In SubtaskExecutionAttemptDetailsHandler and  SubtaskExecutionAttemptAccumulatorsHandler, when calling archiveJsonWithPath() method, it will construct some json information about prior execution attempts but the index is from 0 which might be dropped index for the for loop.  In default, it will return null when trying to get the prior execution attempt (AccessExecution attempt = subtask.getPriorExecutionAttempt).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yumeng Yumeng Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m