Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
-
Twitter Mesos Q2 Sprint 5
-
3
Description
For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add:
I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695 I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695
Empirically, addSlave and updateSlave have become expensive.
Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to addSlave and updateSlave, when there are tens of thousands of slaves this amounts to the large delay seen above.
We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator.
A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size.
Attachments
Attachments
Issue Links
- is blocked by
-
MESOS-2892 Add benchmark for hierarchical allocator.
- Resolved
-
MESOS-2893 Add queue size metrics for the allocator.
- Resolved
- is related to
-
MESOS-2373 DRFSorter needs to distinguish resources from different slaves.
- Resolved