Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-3347

TaskManager (or its ActorSystem) need to restart in case they notice quarantine

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      There are cases where Akka quarantines remote actor systems. In that case, no further communication is possible with that actor system unless one of the two actor systems is restarted.

      The result is that a TaskManager is up and available, but cannot register at the JobManager (Akka refuses connection because of the quarantined state), making the TaskManager a useless process.

      I suggest to let the TaskManager restart itself once it notices that either it quarantined the JobManager, or the JobManager quarantined it.

      It is possible to recognize that by listening to certain events in the actor system event stream: http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            trohrmann Till Rohrmann
            sewen Stephan Ewen
            Votes:
            0 Vote for this issue
            Watchers:
            5 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment