Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
Per MESOS-3157:
Our deployment environments have a lot of churn, with many short-live frameworks that often revive offers. Running the allocator takes a long time (from seconds up to minutes).
In this situation, event-triggered allocation causes the event queue in the allocator process to get very long, and the allocator effectively becomes unresponsive (eg. a revive offers message takes too long to come to the head of the queue).
To remedy the above scenario, it is proposed to perform batching of the enqueued allocation operations so that a single allocation operation can satisfy N enqueued allocations. This should reduce the potential for backlogging in the allocator. See the discussion here in MESOS-3157.
Attachments
Issue Links
- relates to
-
MESOS-3078 Recovered resources are not re-allocated until the next allocation delay.
- Reviewable
- supercedes
-
MESOS-3157 Only perform periodic resource allocations.
- Resolved