Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-6676

Always re-link with scheduler during re-registration.

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.3, 1.1.1, 1.2.0
    • master

    Description

      Scenario:

      1. Framework registers with master using a non-zero failover_timeout and is assigned a FrameworkID.
      2. The master sees an ExitedEvent for the master->scheduler link. This could happen due to some transient network error, e.g., 1-way partition. The master sends a FrameworkErrorMessage to the framework. The master marks the framework as disconnected, but keeps the Framework* for it around in frameworks.registered.
      3. The framework doesn't receive the FrameworkErrorMessage because it is dropped by the network.
      4. The scheduler might receive an ExitedEvent for the scheduler -> master link, but it ignores this anyway (see MESOS-887).
      5. The scheduler sees a new-master-detected event and re-registers with the master. It doesn not set the force flag. This means we follow this code path in the master, which does not relink with the scheduler.

      The result is that scheduler re-registration succeds, but the master -> scheduler link is never re-established.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            neilc Neil Conway
            neilc Neil Conway
            Vinod Kone Vinod Kone
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment