Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8524

When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.5.0
    • None
    • allocation, master
    • Master + Agent running with enabled RESOURCE_PROVIDER capability

    Description

      When an agent with enabled RESOURCE_PROVIDER capability (re-)registers with the master it sends a UPDATE_SLAVE after being (re-)registered. In the master, the agent is added (back) to the allocator, as soon as it's (re-)registered, i.e. before UPDATE_SLAVE is being send. This triggers an allocation and offers might get sent out to frameworks. When UPDATE_SLAVE is being handled in the master, these offers have to be rescinded, as they're based on an outdated agent state.
      Internally, the allocator defers a offer callback in the master (Master::offer). In rare cases a UPDATE_SLAVE message might arrive at the same time and its handler in the master called before the offer callback (but after the actual allocation took place). In this case the (outdated) offer is still sent to frameworks and never rescinded.

      Here's the relevant log lines, this was discovered while working on https://reviews.apache.org/r/65045/:

      I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation for 1 agents in 704915ns
      I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 (172.18.8.20) with total oversubscribed resources {}
      I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to framework 53c557e7-3161-449b-bacc-a4f8c02e78e7-0000 (default) at scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
      I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 40444ns
      I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), {  } (used)
      I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kaysoky Joseph Wu
            nfnt Jan Schlicht
            Benjamin Mahler Benjamin Mahler

            Dates

              Created:
              Updated:

              Agile

                Completed Sprints:
                Mesosphere Sprint 74 ended 15/Feb/18
                Mesosphere Sprint 75 ended 03/Mar/18
                Mesosphere Sprint 76 ended 30/Mar/18
                View on Board

                Slack

                  Issue deployment