Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7882

Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Accepted
    • Minor
    • Resolution: Unresolved
    • 1.3.0
    • None
    • master
    • Ubuntu 14:04(trusty)
      Mesos master branch.
      SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded

    Description

      We are running mesos 1.1.0 in production. We use a custom autoscaler for scaling our mesos cluster up and down. While scaling down the cluster, autoscaler makes a POST request to mesos master /maintenance/schedule endpoint with a set of slaves to move to maintenance mode. This forces mesos master to rescind all the in-flight offers from all the slaves in the cluster. If our scheduler accepts one of these offers, then we get a TASK_LOST status update back for that task. We also see such (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log lines in mesos master logs.

      After reading the code(refs: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it appears that offers are getting rescinded for all the slaves. I am not sure what is the expected behavior here, but it makes more sense if only resources from slaves marked for maintenance are reclaimed.

      Experiment:
      To verify that it is actually happening, I checked out the master branch(sha: a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). Built the binary and started a mesos master and 2 agent processes. Used a basic python framework that launches docker containers on these slaves. Verified that there is no existing schedule for any slaves using `curl 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of the slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) after starting the mesos framework.

      Logs:
      mesos-master: https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
      mesos-slave1: https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
      mesos-slave2: https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
      Mesos framework: https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a

      I think mesos should rescind offers and inverse offers only for those slaves that are marked for maintenance(draining mode).

      Attachments

        Issue Links

          Activity

            People

              kaysoky Joseph Wu
              sagar8192 Sagar Sadashiv Patwardhan
              Benjamin Mahler Benjamin Mahler
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: