Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5830

OutOfMemoryError during notify final state in TaskExecutor may cause job stuck

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: None
    • Labels:
      None

      Description

      The scenario is like this:

      JobMaster tries to cancel all the executions when process failed execution, and the task executor already acknowledge the cancel rpc message.
      When notify the final state in TaskExecutor, it causes OOM in AkkaRpcActor and this error is caught to log the info. The final state will not be sent any more.
      The JobMaster can not receive the final state and trigger the restart strategy.

      One solution is to catch the OutOfMemoryError and throw it, then it will cause to shut down the ActorSystem resulting in exiting the TaskExecutor. The JobMaster can be notified of TaskExecutor failure and fail all the tasks to trigger restart successfully.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                zjwang zhijiang
                Reporter:
                zjwang zhijiang
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: