Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5830

OutOfMemoryError during notify final state in TaskExecutor may cause job stuck

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.3.0
    • None
    • None

    Description

      The scenario is like this:

      JobMaster tries to cancel all the executions when process failed execution, and the task executor already acknowledge the cancel rpc message.
      When notify the final state in TaskExecutor, it causes OOM in AkkaRpcActor and this error is caught to log the info. The final state will not be sent any more.
      The JobMaster can not receive the final state and trigger the restart strategy.

      One solution is to catch the OutOfMemoryError and throw it, then it will cause to shut down the ActorSystem resulting in exiting the TaskExecutor. The JobMaster can be notified of TaskExecutor failure and fail all the tasks to trigger restart successfully.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zjwang Zhijiang
            zjwang Zhijiang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment