Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7882

Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Accepted
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.3.0
    • Fix Version/s: None
    • Component/s: master
    • Environment:

      Ubuntu 14:04(trusty)
      Mesos master branch.
      SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded

      Description

      We are running mesos 1.1.0 in production. We use a custom autoscaler for scaling our mesos cluster up and down. While scaling down the cluster, autoscaler makes a POST request to mesos master /maintenance/schedule endpoint with a set of slaves to move to maintenance mode. This forces mesos master to rescind all the in-flight offers from all the slaves in the cluster. If our scheduler accepts one of these offers, then we get a TASK_LOST status update back for that task. We also see such (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log lines in mesos master logs.

      After reading the code(refs: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it appears that offers are getting rescinded for all the slaves. I am not sure what is the expected behavior here, but it makes more sense if only resources from slaves marked for maintenance are reclaimed.

      Experiment:
      To verify that it is actually happening, I checked out the master branch(sha: a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). Built the binary and started a mesos master and 2 agent processes. Used a basic python framework that launches docker containers on these slaves. Verified that there is no existing schedule for any slaves using `curl 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of the slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) after starting the mesos framework.

      Logs:
      mesos-master: https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
      mesos-slave1: https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
      mesos-slave2: https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
      Mesos framework: https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a

      I think mesos should rescind offers and inverse offers only for those slaves that are marked for maintenance(draining mode).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kaysoky Joseph Wu
                Reporter:
                sagar8192 Sagar Sadashiv Patwardhan
                Shepherd:
                Benjamin Mahler
              • Votes:
                1 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated: