[FLINK-14029] Update Flink's Mesos scheduling behavior to reject all expired offers - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.10.0
Component/s: None
Labels:
- pull-request-available

Release Note:
Flink's Mesos integration now rejects all expired offers instead of only 4. This improves the situation where Fenzo holds on a lot of expired offers without giving them back to the Mesos resource manager.

Description

While digging into why our Flink jobs weren't being scheduled on our internal Mesos setup we noticed that we were hitting Mesos quota limits tied to the way we've set up the Fenzo (https://github.com/Netflix/Fenzo/) library defaults in the Flink project.

Behavior we noticed was that we got a bunch of offers from our Mesos master (50+) out of which only 1 or 2 of them were super skewed and took up a huge chunk of our disk resource quota. Thanks to this we were not sent any new / different offers (as our usage at the time + resource offers reached our Mesos disk quota). As the Flink / Fenzo Mesos scheduling code was not using the 1-2 skewed disk offers they end up expiring. The way we've set up the Fenzo scheduler is to use the default values on when to expire unused offers (120s) and maximum number of unused offer leases at a time (4). Unfortunately as we have a considerable number of outstanding expired offers (50+) we end up in a situation where we reject only 4 or so every 2 mins and we never get around to rejecting the super skewed disk ones which are stopping us from scheduling our Flink job. Thanks to this we end up in a situation where our job is waiting to be scheduled for more than an hour.

An option to work around this is to reject all expired offers at 2 minute expiry time rather than hold on to them. This will allow Mesos to send alternate offers that might be scheduled by Fenzo.

Attachments

Issue Links

links to

GitHub Pull Request #9652

Activity

Till Rohrmann added a comment - 16/Sep/19 11:58

Fixed via e6f87d33ae891dce89463868500f91c3fe01265c

Till Rohrmann added a comment - 16/Sep/19 11:58 Fixed via e6f87d33ae891dce89463868500f91c3fe01265c

People

Assignee:: Piyush Narang

Reporter:: Piyush Narang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Sep/19 14:39

Updated:: 16/Sep/19 11:58

Resolved:: 16/Sep/19 11:58

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m