Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Scenario:
- Framework registers with master using a non-zero failover_timeout and is assigned a FrameworkID.
- The master sees an ExitedEvent for the master->scheduler link. This could happen due to some transient network error, e.g., 1-way partition. The master sends a FrameworkErrorMessage to the framework. The master marks the framework as disconnected, but keeps the Framework* for it around in frameworks.registered.
- The framework doesn't receive the FrameworkErrorMessage because it is dropped by the network.
- The scheduler might receive an ExitedEvent for the scheduler -> master link, but it ignores this anyway (see MESOS-887).
- The scheduler sees a new-master-detected event and re-registers with the master. It doesn not set the force flag. This means we follow this code path in the master, which does not relink with the scheduler.
The result is that scheduler re-registration succeds, but the master -> scheduler link is never re-established.
Attachments
Attachments
Issue Links
- is related to
-
MESOS-5180 Scheduler driver does not detect disconnection with master and reregister.
- Accepted
-
MESOS-887 Scheduler driver should use exited() to detect disconnection with Master.
- Open