Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-12219

Yarn application can't stop when flink job failed in per-job yarn cluster mode

    XMLWordPrintableJSON

Details

    Description

      Issue detail info

      In our flink(1.6.3) product env, I often encounter a scene that yarn application can't stop when flink job failed in per-job yarn cluste mode, so I deeply analyzed the reason why it happened.

      When a flink job fail, system will write an archive file to a FileSystem through MiniDispatcher#archiveExecutionGraph method, then notify YarnJobClusterEntrypoint to shutDown. But, if MiniDispatcher#archiveExecutionGraph throw exceptions during execution, it affect the following calls.

      So I open FLINK-12247 to solve NEP bug when system write archive to FileSystem. But We still need to consider other exceptions, so we should catch Exception / Throwable not just IOExcetion.

      Flink yarn job fail flow

      Flink yarn job fail on yarn

       

      Flink yarn application can't stop

       

       

      Attachments

        1. image-2019-04-23-17-37-00-081.png
          46 kB
          lamber-ken
        2. image-2019-04-17-15-02-49-513.png
          31 kB
          lamber-ken
        3. image-2019-04-17-15-00-40-687.png
          62 kB
          lamber-ken
        4. fix-bug.patch
          0.9 kB
          lamber-ken

        Activity

          People

            lamber-ken lamber-ken
            lamber-ken lamber-ken
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h
                1h