Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-2891

Performance regression in hierarchical allocator.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 0.23.0
    • allocation, master
    • Twitter Mesos Q2 Sprint 5
    • 3

    Description

      For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add:

      45 minute delay
      I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695
      I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695
      

      Empirically, addSlave and updateSlave have become expensive.

      Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to addSlave and updateSlave, when there are tens of thousands of slaves this amounts to the large delay seen above.

      We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator.

      A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size.

      Attachments

        1. Screen Shot 2015-06-18 at 5.02.26 PM.png
          484 kB
          Jie Yu
        2. perf-kernel.svg
          412 kB
          Benjamin Mahler

        Issue Links

          Activity

            People

              jieyu Jie Yu
              bmahler Benjamin Mahler
              Benjamin Mahler Benjamin Mahler
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: