Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12219

Yarn application can't stop when flink job failed in per-job yarn cluster mode

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

      Description

      Issue detail info

      In our flink(1.6.3) product env, I often encounter a scene that yarn application can't stop when flink job failed in per-job yarn cluste mode, so I deeply analyzed the reason why it happened.

      When a flink job fail, system will write an archive file to a FileSystem through MiniDispatcher#archiveExecutionGraph method, then notify YarnJobClusterEntrypoint to shutDown. But, if MiniDispatcher#archiveExecutionGraph throw exceptions during execution, it affect the following calls.

      So I open FLINK-12247 to solve NEP bug when system write archive to FileSystem. But We still need to consider other exceptions, so we should catch Exception / Throwable not just IOExcetion.

      Flink yarn job fail flow

      Flink yarn job fail on yarn

       

      Flink yarn application can't stop

       

       

        Attachments

          Activity

            People

            • Assignee:
              lamber-ken lamber-ken
              Reporter:
              lamber-ken lamber-ken

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h

                  Issue deployment