Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-6483

Check failure when a 1.1 master marking a 0.28 agent as unreachable



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0, 1.2.0
    • None
    • None


      When upgrading directly from mesos version 0.28 to a version > 1.0 there could be a scenario that may make the CHECK(frameworks.recovered.contains(frameworkId)) in Master::_markUnreachable(..) to fail. The following sequence of events can happen.

      1) The master gets upgraded first to the new version and the agent lets say X is still at mesos version 0.28
      2) This agent X (at mesos 0.28) attempts to re-registers with the master (at lets say 1.1) and as a result doesn't send the frameworks (frameworkInfos) in the ReRegisterSlave message since it wasn't available in the older mesos version.
      3) Among other frameworks on this agent X, is a framework Y which didn’t re-register after master’s failover. Since the master builds the frameworks.recovered from the frameworkInfos that agents provide it so this framework Y is neither in the recovered nor in registered frameworks.
      4) The agent X post re-registering fails master’s health check and is being marked unreachable by the master. The check CHECK(frameworks.recovered.contains(frameworkId)) will get fired for the framework Y since it is neither in recovered or registered but has tasks running on the agent X.


        Issue Links



              neilc Neil Conway
              megha.sharma Megha Sharma
              Vinod Kone Vinod Kone
              0 Vote for this issue
              3 Start watching this issue