Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8185

Tasks can be known to the agent but unknown to the master.

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
    • Sprint:
      Mesosphere Sprint 68

      Description

      Currently, when a master re-registers an agent that was marked unreachable, it shutdowns all not partition-aware frameworks on that agent. When a master re-registers an agent that is already registered, it doesn't check that all tasks from the slave's re-registration message are known to it.

      It is possible that due to a transient loss of connectivity an agent may miss SlaveReregisteredMessage along with ShutdownFrameworkMessage and thus will not kill not partition-aware tasks. But the master will mark the agent as registered and will not re-add tasks that it thought will be killed. The agent may re-register again, this time successfully, before becoming marked unreachable while never having terminated tasks of not partition-aware frameworks. The master will simply forget those tasks ever existed, because it has "removed" them during the previous re-registration.

      Example scenario:

      1. Connection from the master to the agent stops working
      2. Agent doesn't see pings from the master and attempts to re-register
      3. Master sends SlaveRegisteredMessage and ShutdownSlaveMessage, which don't get to the agent because of the connection failure. Agent is marked registered.
      4. Network issue resolves, connection breaks. Agent retries re-registration.
      5. Master thinks that the agent was registered since step (3) and just re-sends SlaveRegisteredMessage. Tasks remain running on the agent.

      One of the possible solutions would be to compare the list of tasks the the already registered agent reports in ReregisterSlaveMessage and the list of tasks the master has. In this case anything that the master doesn't know about should not exist on the agent.

        Attachments

          Activity

            People

            • Assignee:
              ipronin Ilya
              Reporter:
              ipronin Ilya
              Shepherd:
              Benjamin Mahler

              Dates

              • Created:
                Updated:
                Resolved:

                Agile

                  Issue deployment