Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-6317

Race in master/allocator when updating oversubscribed resources of an agent.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.0
    • None
    • None

    Description

      Currently, when updateSlave in master, it will first rescind offers and then updateSlave in allocator, but there is a race for this, there might be a batch allocation inserted bwteen the two. In this case, the order will be rescind offer -> batch allocation -> update slave. This order will cause some issues when the oversubscribed resources was decreased.

      Suppose the oversubscribed resources was decreased from 2 to 1, then after rescind offer finished, the batch allocation will allocate the old 2 oversubscribed resources again, then update slave will update the total oversubscribed resources to 1. This will cause the agent host have some time overcommitted due to the tasks can still use 2 oversubscribed resources but not 1 oversubscribed resources, once the tasks using the 2 oversubscribed resources finished, everything goes back.

      So here we should adjust the order of rescind offer and updateSlave in master to avoid resource overcommit.

      If we update slave first then rescind offer, the order will be update slave -> batch allocation -> rescind offer, this order will have no problem when descreasing resources. Suppose the oversubscribed resources was decreased from 2 to 1, then update slave will update total oversubscribed resources to 1 directly, then the batch allocation will not allocate any oversubscribed resources since there are more allocated than total oversubscribed resources, then rescind offer will rescind all offers using oversubscribed resources. This will not lead the agent host to be overcommitted.

      Attachments

        Activity

          People

            gyliu Guangya Liu
            gyliu Guangya Liu
            Benjamin Mahler Benjamin Mahler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: