Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.0.1
-
None
-
None
Description
As I understand it, the master will send an offer to a list of frameworks ordered by DRF, until the offer is accepted. There is a 1s wait time between each offering. Once the decline timeout for the first framework has been reached, rather than continuing to submit the offer to the rest of the frameworks in the list, the master starts over at the beginning, starving the rest of the frameworks.
This means that in order for Mesos to support > 5 concurrent frameworks, all frameworks must be good citizens and set their decline timeout to something large or suppress offers. I think this is a fairly undesirable state of things.
I propose that the master instead continues to submit the offer to every registered framework, even if the declineOffer timeout has been reached.
The potential increase in task startup latency that could be introduced by this change can be obviated in part if we also make the master smarter about how long to wait between successive offers, rather than a static 1s.
Attachments
Issue Links
- duplicates
-
MESOS-3202 Avoid role/framework offer starvation in DRF allocator.
- Resolved
- is related to
-
SPARK-19703 Add Suppress/Revive support to the Mesos Spark Driver
- Resolved
- relates to
-
MESOS-3202 Avoid role/framework offer starvation in DRF allocator.
- Resolved
-
SPARK-20483 Mesos Coarse mode may starve other Mesos frameworks if max cores is not a multiple of executor cores
- Resolved
-
MESOS-6111 Offer cycle is undocumented
- Open