Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12219

Yarn application can't stop when flink job failed in per-job yarn cluster mode

    XMLWordPrintableJSON

    Details

      Description

      Issue detail info

      In our flink(1.6.3) product env, I often encounter a scene that yarn application can't stop when flink job failed in per-job yarn cluste mode, so I deeply analyzed the reason why it happened.

      When a flink job fail, system will write an archive file to a FileSystem through MiniDispatcher#archiveExecutionGraph method, then notify YarnJobClusterEntrypoint to shutDown. But, if MiniDispatcher#archiveExecutionGraph throw exceptions during execution, it affect the following calls.

      So I open FLINK-12247 to solve NEP bug when system write archive to FileSystem. But We still need to consider other exceptions, so we should catch Exception / Throwable not just IOExcetion.

      Flink yarn job fail flow

      Flink yarn job fail on yarn

       

      Flink yarn application can't stop

       

       

        Attachments

        1. image-2019-04-17-15-00-40-687.png
          62 kB
          lamber-ken
        2. image-2019-04-17-15-02-49-513.png
          31 kB
          lamber-ken
        3. fix-bug.patch
          0.9 kB
          lamber-ken
        4. image-2019-04-23-17-37-00-081.png
          46 kB
          lamber-ken

          Activity

            People

            • Assignee:
              lamber-ken lamber-ken
              Reporter:
              lamber-ken lamber-ken
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h