Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-3347

TaskManager (or its ActorSystem) need to restart in case they notice quarantine

    Details

      Description

      There are cases where Akka quarantines remote actor systems. In that case, no further communication is possible with that actor system unless one of the two actor systems is restarted.

      The result is that a TaskManager is up and available, but cannot register at the JobManager (Akka refuses connection because of the quarantined state), making the TaskManager a useless process.

      I suggest to let the TaskManager restart itself once it notices that either it quarantined the JobManager, or the JobManager quarantined it.

      It is possible to recognize that by listening to certain events in the actor system event stream: http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state

        Issue Links

          Activity

          Hide
          StephanEwen Stephan Ewen added a comment -

          Duplicate of FLINK-3345

          Show
          StephanEwen Stephan Ewen added a comment - Duplicate of FLINK-3345
          Hide
          till.rohrmann Till Rohrmann added a comment -

          Haha, I just closed FLINK-3345 as a duplicate of this one. Let's reopen this issue.

          Show
          till.rohrmann Till Rohrmann added a comment - Haha, I just closed FLINK-3345 as a duplicate of this one. Let's reopen this issue.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2696

          FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down

          The QuarantineMonitor subscribes to the actor system's event bus and listens to
          AssociationErrorEvents. These are the events which are generated when the actor system
          has quarantined another actor system or if it has been quarantined by another actor
          system. In case of the quarantined state, the actor system will be shutdown killing
          all actors and then the JVM is terminated.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink quarantineMonitor

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2696.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2696


          commit 99b93e896dfa9d04ffb9cb65286850dfd8835ba5
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-26T22:24:12Z

          FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down

          The QuarantineMonitor subscribes to the actor system's event bus and listens to
          AssociationErrorEvents. These are the events which are generated when the actor system
          has quarantined another actor system or if it has been quarantined by another actor
          system. In case of the quarantined state, the actor system will be shutdown killing
          all actors and then the JVM is terminated.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2696 FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink quarantineMonitor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2696.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2696 commit 99b93e896dfa9d04ffb9cb65286850dfd8835ba5 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-26T22:24:12Z FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/2697

          [backport] FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down

          This is a back port for release 1.1.

          The QuarantineMonitor subscribes to the actor system's event bus and listens to
          AssociationErrorEvents. These are the events which are generated when the actor system
          has quarantined another actor system or if it has been quarantined by another actor
          system. In case of the quarantined state, the actor system will be shutdown killing
          all actors and then the JVM is terminated.

          Per default the `QuarantineMonitor` is not started for the `TaskManagers`.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink backportQuarantineMonitor

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/2697.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #2697


          commit 7d89c05aa335046b29f5e263a264dc1d3cdcd890
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-26T22:24:12Z

          FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down

          The QuarantineMonitor subscribes to the actor system's event bus and listens to
          AssociationErrorEvents. These are the events which are generated when the actor system
          has quarantined another actor system or if it has been quarantined by another actor
          system. In case of the quarantined state, the actor system will be shutdown killing
          all actors and then the JVM is terminated.

          commit 15c88f636fe2652eb937e0768dd379c058636199
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-26T22:39:17Z

          Add configuration switch to enable the quarantine monitor for TaskManagers

          Per default the QuarantineMonitor is disabled for TaskManagers in order to not change
          the behaviour of 1.1.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2697 [backport] FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down This is a back port for release 1.1. The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated. Per default the `QuarantineMonitor` is not started for the `TaskManagers`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportQuarantineMonitor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2697.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2697 commit 7d89c05aa335046b29f5e263a264dc1d3cdcd890 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-26T22:24:12Z FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated. commit 15c88f636fe2652eb937e0768dd379c058636199 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-26T22:39:17Z Add configuration switch to enable the quarantine monitor for TaskManagers Per default the QuarantineMonitor is disabled for TaskManagers in order to not change the behaviour of 1.1.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2697

          Looks good to me.
          I would suggest to rename `taskmanager.quarantine-monitor.enable` to make it more accessible to users (few will understand what quarantine means). We could call it `taskmanager.exit-on-fatal-akka-error` or so.

          Otherwise +1 to merge this

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2697 Looks good to me. I would suggest to rename `taskmanager.quarantine-monitor.enable` to make it more accessible to users (few will understand what quarantine means). We could call it `taskmanager.exit-on-fatal-akka-error` or so. Otherwise +1 to merge this
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2696

          That's a very good addition, we need something like that.

          After an offline discussion with @tillrohrmann we came to the following conclusion:

          There is a tricky problem with that pure appraoch: When the JobManager fails, all TaskManagers will "quarantine" that JobManager's actor system after they detected the failure. That means they exit and restart. Effectively, a JobManager failure results in all TaskManagers restarting.
          That is a bit heavy.

          Instead, we'll adjust this to do the following:

          • TaskManagers must not watch the JobManager via Akka. That way, JobManager failures do not cause any quarantining on the TaskManager side.
          • The JobManager keeps watching the TaskManagers via Akka, so TaskManager failures (false positives) still result in TaskManager quarantine, which means the TaskManager need to restart when a TM-JM link breaks

          How do TaskManagers then detect JobManager failure?

          • TaskManagers send heartbeats to the JobManager anyways (accumulators, in the future task status reconciliation). The TaskManagers use that to detect JobManager failures.
          • In high availability setups, TaskManagers notice JobManager failure also via ZooKeeper
          • In addition, in flip-6 the resource manager tells TaskManagers about JobManager container failures
          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2696 That's a very good addition, we need something like that. After an offline discussion with @tillrohrmann we came to the following conclusion: There is a tricky problem with that pure appraoch: When the JobManager fails, all TaskManagers will "quarantine" that JobManager's actor system after they detected the failure. That means they exit and restart. Effectively, a JobManager failure results in all TaskManagers restarting. That is a bit heavy. Instead, we'll adjust this to do the following: TaskManagers must not watch the JobManager via Akka. That way, JobManager failures do not cause any quarantining on the TaskManager side. The JobManager keeps watching the TaskManagers via Akka, so TaskManager failures (false positives) still result in TaskManager quarantine, which means the TaskManager need to restart when a TM-JM link breaks How do TaskManagers then detect JobManager failure? TaskManagers send heartbeats to the JobManager anyways (accumulators, in the future task status reconciliation). The TaskManagers use that to detect JobManager failures. In high availability setups, TaskManagers notice JobManager failure also via ZooKeeper In addition, in flip-6 the resource manager tells TaskManagers about JobManager container failures
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/2696

          This means we should do a separate pull request which changes the TaskManager side of the failure detection from Akka watch to the TaskManager heartbeats.

          After that is in, we merge this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2696 This means we should do a separate pull request which changes the TaskManager side of the failure detection from Akka watch to the TaskManager heartbeats. After that is in, we merge this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2697

          Thanks for the review @StephanEwen. I will change the configuration name and then merge the PR into the release-1.1 branch.

          Travis passed locally and the failing test cases are unrelated.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2697 Thanks for the review @StephanEwen. I will change the configuration name and then merge the PR into the release-1.1 branch. Travis passed locally and the failing test cases are unrelated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/2697

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/2697
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2696

          Thanks for the review @StephanEwen. Yes let's do it as you've proposed. I've opened an [issue](https://issues.apache.org/jira/browse/FLINK-4944) for replacing Akka's death watch with our own heartbeat on the `TaskManager` side.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2696 Thanks for the review @StephanEwen. Yes let's do it as you've proposed. I've opened an [issue] ( https://issues.apache.org/jira/browse/FLINK-4944 ) for replacing Akka's death watch with our own heartbeat on the `TaskManager` side.
          Hide
          rmetzger Robert Metzger added a comment - - edited

          Is FLINK-4944 a fix for 1.3.0 of this issue?

          I guess we won't fix it for 1.0.0

          Show
          rmetzger Robert Metzger added a comment - - edited Is FLINK-4944 a fix for 1.3.0 of this issue? I guess we won't fix it for 1.0.0
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/2696

          Rebased this PR onto the latest master.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/2696 Rebased this PR onto the latest master.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user tillrohrmann opened a pull request:

          https://github.com/apache/flink/pull/3363

          [backport] FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down

          This is a backport of #2696 onto the `release-1.2` branch.

          The QuarantineMonitor subscribes to the actor system's event bus and listens to
          AssociationErrorEvents. These are the events which are generated when the actor system
          has quarantined another actor system or if it has been quarantined by another actor
          system. In case of the quarantined state, the actor system will be shutdown killing
          all actors and then the JVM is terminated.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/tillrohrmann/flink quarantineMonitorBackport

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3363.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3363


          commit 0cee579bde9f04a07f36b9c01be3e8089c34b0a4
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2016-10-26T22:24:12Z

          FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down

          The QuarantineMonitor subscribes to the actor system's event bus and listens to
          AssociationErrorEvents. These are the events which are generated when the actor system
          has quarantined another actor system or if it has been quarantined by another actor
          system. In case of the quarantined state, the actor system will be shutdown killing
          all actors and then the JVM is terminated.

          commit c52bcfb24ba51120b51d1b62cec44f9a88690e19
          Author: Till Rohrmann <trohrmann@apache.org>
          Date: 2017-02-20T15:37:28Z

          Disable QuarantineMonitor per default; Reintroduce config option for activation


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/3363 [backport] FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down This is a backport of #2696 onto the `release-1.2` branch. The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink quarantineMonitorBackport Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3363.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3363 commit 0cee579bde9f04a07f36b9c01be3e8089c34b0a4 Author: Till Rohrmann <trohrmann@apache.org> Date: 2016-10-26T22:24:12Z FLINK-3347 [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated. commit c52bcfb24ba51120b51d1b62cec44f9a88690e19 Author: Till Rohrmann <trohrmann@apache.org> Date: 2017-02-20T15:37:28Z Disable QuarantineMonitor per default; Reintroduce config option for activation
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3363

          Rebasing this PR on the latest release-1.2 branch. If Travis passes, then I'll merge the PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3363 Rebasing this PR on the latest release-1.2 branch. If Travis passes, then I'll merge the PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann commented on the issue:

          https://github.com/apache/flink/pull/3363

          Travis passed locally: https://travis-ci.org/tillrohrmann/flink/builds/208150099. Merging this PR.

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/3363 Travis passed locally: https://travis-ci.org/tillrohrmann/flink/builds/208150099 . Merging this PR.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tillrohrmann closed the pull request at:

          https://github.com/apache/flink/pull/3363

          Show
          githubbot ASF GitHub Bot added a comment - Github user tillrohrmann closed the pull request at: https://github.com/apache/flink/pull/3363

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              StephanEwen Stephan Ewen
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development