[MESOS-2891] Performance regression in hierarchical allocator. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.23.0
Component/s: allocation, master
Labels:
- twitter

Target Version/s:

0.23.0
Sprint:
Twitter Mesos Q2 Sprint 5
Story Points:
3

Description

For large clusters, the 0.23.0 allocator cannot keep up with the volume of slaves. After the following slave was re-registered, it took the allocator a long time to work through the backlog of slaves to add:

45 minute delay

I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 20150422-211121-2148346890-5050-3253-S4695
I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 20150422-211121-2148346890-5050-3253-S4695

Empirically, addSlave and updateSlave have become expensive.

Some timings from a production cluster reveal that the allocator spending in the low tens of milliseconds for each call to addSlave and updateSlave, when there are tens of thousands of slaves this amounts to the large delay seen above.

We also saw a slow steady increase in memory consumption, hinting further at a queue backup in the allocator.

A synthetic benchmark like we did for the registrar would be prudent here, along with visibility into the allocator's queue size.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

perf-kernel.svg
19/Jun/15 01:27
412 kB
Benjamin Mahler
Screen Shot 2015-06-18 at 5.02.26 PM.png
19/Jun/15 00:02
484 kB
Jie Yu

Issue Links

is blocked by

MESOS-2892 Add benchmark for hierarchical allocator.

Resolved

MESOS-2893 Add queue size metrics for the allocator.

Resolved

is related to

MESOS-2373 DRFSorter needs to distinguish resources from different slaves.

Resolved

Activity

People

Assignee:: Jie Yu

Reporter:: Benjamin Mahler

Shepherd:: Benjamin Mahler

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Jun/15 21:34

Updated:: 21/Jun/15 18:52

Resolved:: 21/Jun/15 18:52