Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2891

Performance regression in hierarchical allocator.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: allocation, master
    • Labels:
    • Target Version/s:
    • Sprint:
      Twitter Mesos Q2 Sprint 5
    • Story Points:
      3

      Description

      For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add:

      45 minute delay
      I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695
      I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695
      

      Empirically, addSlave and updateSlave have become expensive.

      Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to addSlave and updateSlave, when there are tens of thousands of slaves this amounts to the large delay seen above.

      We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator.

      A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size.

        Attachments

        1. perf-kernel.svg
          412 kB
          Benjamin Mahler
        2. Screen Shot 2015-06-18 at 5.02.26 PM.png
          484 kB
          Jie Yu

          Issue Links

            Activity

              People

              • Assignee:
                jieyu Jie Yu
                Reporter:
                bmahler Benjamin Mahler
                Shepherd:
                Benjamin Mahler
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: