Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4302

Offer filter timeouts are ignored if the allocator is slow or backlogged.

    Details

    • Target Version/s:
    • Sprint:
      Mesosphere Sprint 26, Mesosphere Sprint 27

      Description

      Currently, when the allocator recovers resources from an offer, it creates a filter timeout based on time at which the call is processed.

      This means that if it takes longer than the filter duration for the allocator to perform an allocation for the relevant agent, then the filter is never applied.

      This leads to pathological behavior: if the framework sets a filter duration that is smaller than the wall clock time it takes for us to perform the next allocation, then the filters will have no effect. This can mean that low share frameworks may continue receiving offers that they have no intent to use, without other frameworks ever receiving these offers.

      The workaround for this is for frameworks to set high filter durations, and possibly reviving offers when they need more resources, however, we should fix this issue in the allocator. (i.e. derive the timeout deadlines and expiry based on allocation times).

      This seems to warrant cherry-picking into bug fix releases.

        Issue Links

          Activity

          Hide
          gyliu Guangya Liu added a comment -

          I have some draft idea for this as following (https://reviews.apache.org/r/42028/ have some problem for unit test), if the filter duration time for recover resources is less than allocation interval, then set the filter duration time to the allocation interval with a INFO level message telling end user what allocator is doing now. Joseph Wu Alexander Rukletsov what do you think? Thanks.

           if (seconds.get() != Duration::zero()) {
              Duration filterTimeOut = seconds.get();
              if (filterTimeOut < allocationInterval) {
                filterTimeOut = allocationInterval;
                LOG(INFO) << "Framework " << frameworkId
                          << " filtered slave " << slaveId
                          << " for " << seconds.get()
                          << " which is less than allocationInterval "
                          << allocationInterval
                          << ", using allocationInterval "
                          << allocationInterval
                          << " instead to make sure the recovered resources can"
                          << " be aggregated for at least one allocation cycle.";
              } else {
          >>    VLOG(1) << "Framework " << frameworkId
          >>            << " filtered slave " << slaveId
                        << " for " << filterTimeOut;
              }
          
          Show
          gyliu Guangya Liu added a comment - I have some draft idea for this as following ( https://reviews.apache.org/r/42028/ have some problem for unit test), if the filter duration time for recover resources is less than allocation interval, then set the filter duration time to the allocation interval with a INFO level message telling end user what allocator is doing now. Joseph Wu Alexander Rukletsov what do you think? Thanks. if (seconds.get() != Duration::zero()) { Duration filterTimeOut = seconds.get(); if (filterTimeOut < allocationInterval) { filterTimeOut = allocationInterval; LOG(INFO) << "Framework " << frameworkId << " filtered slave " << slaveId << " for " << seconds.get() << " which is less than allocationInterval " << allocationInterval << ", using allocationInterval " << allocationInterval << " instead to make sure the recovered resources can" << " be aggregated for at least one allocation cycle." ; } else { >> VLOG(1) << "Framework " << frameworkId >> << " filtered slave " << slaveId << " for " << filterTimeOut; }
          Hide
          klaus1982 Klaus Ma added a comment -

          I'd suggest to keep current behaviour, it dependent on how framework evaluate the duration and how to use it.

          Here's the user cases when testing Swarm/Mesos: currently, Swarm will use the whole slave to launch tasks and the un-used resources are returned to Mesos; the default value of filter duration is 5s (this pull request is used to configure it: https://github.com/docker/swarm/pull/1585). When Swarm/K8S on Mesos, I set it to 0.1 to ask Mesos re-shuffle the resources in next allocation cycle.

          Benjamin Mahler/Alexander Rukletsov/Joseph Wu, any comments to this case?

          Show
          klaus1982 Klaus Ma added a comment - I'd suggest to keep current behaviour, it dependent on how framework evaluate the duration and how to use it. Here's the user cases when testing Swarm/Mesos: currently, Swarm will use the whole slave to launch tasks and the un-used resources are returned to Mesos; the default value of filter duration is 5s (this pull request is used to configure it: https://github.com/docker/swarm/pull/1585 ). When Swarm/K8S on Mesos, I set it to 0.1 to ask Mesos re-shuffle the resources in next allocation cycle. Benjamin Mahler / Alexander Rukletsov / Joseph Wu , any comments to this case?
          Hide
          gyliu Guangya Liu added a comment -

          In my understanding, the filter should work together with reviveOffers, the filter can aggregate small resources into a big one and then reviveOffer can make those resources usable. I think it is a HACK to set the filter smaller than allocation interval.

          Show
          gyliu Guangya Liu added a comment - In my understanding, the filter should work together with reviveOffers, the filter can aggregate small resources into a big one and then reviveOffer can make those resources usable. I think it is a HACK to set the filter smaller than allocation interval.
          Hide
          bmahler Benjamin Mahler added a comment -

          Hm.. I didn't understand the use case or what setting it specifically to 100 milliseconds will accomplish. Is it that you don't want filtering at all? (then just set it to 0 seconds rather than 100 milliseconds)

          Show
          bmahler Benjamin Mahler added a comment - Hm.. I didn't understand the use case or what setting it specifically to 100 milliseconds will accomplish. Is it that you don't want filtering at all? (then just set it to 0 seconds rather than 100 milliseconds)
          Hide
          klaus1982 Klaus Ma added a comment -

          Regarding reviveOffers, do you mean I have to call reviveOffers just after launching tasks? But how do I distinguish special filter and re-shuffle filter?

          This case is about re-shuffling resources between "nice/friendly"framework. offerRescinded is a better solution for that; but it seems only Quota & Maintenance will trigger it for now. I logged a JIRA (MESOS-4303) about re-shuffling resources between framework. I think it's too early to do this ticket before framework support it when re-shuffling.

          Show
          klaus1982 Klaus Ma added a comment - Regarding reviveOffers , do you mean I have to call reviveOffers just after launching tasks? But how do I distinguish special filter and re-shuffle filter? This case is about re-shuffling resources between "nice/friendly"framework. offerRescinded is a better solution for that; but it seems only Quota & Maintenance will trigger it for now. I logged a JIRA ( MESOS-4303 ) about re-shuffling resources between framework. I think it's too early to do this ticket before framework support it when re-shuffling.
          Hide
          klaus1982 Klaus Ma added a comment -

          Yes, in this case, I just want Mesos to re-shuffle resources between Swarm & K8S.

          Show
          klaus1982 Klaus Ma added a comment - Yes, in this case, I just want Mesos to re-shuffle resources between Swarm & K8S.
          Hide
          alexr Alexander Rukletsov added a comment -

          Let me elaborate a bit on the issue and possible workarounds.

          First off, the described situation—when the filter is technically never applied—may happen not even when the allocator is slow or backlogged. For example, if the timeout is set to 10s and allocation interval is 100s. Moreover, a 3rdparty allocator can do allocations in arbitrary manner.

          However, there is a real problem: idle low share frameworks may block resources in "offer-decline" cycles. Joris Van Remoortere nicely summarized the issue in one sentence: "It's a shame if the 'default' (5s filter) doesn't co-operate well as your cluster scales". We have to fix it.

          I would argue that the "right" solution to this problem is a combination of quota and suppressing offers. But quota is neither mandatory nor it is available before 0.27.0 (while the fix can be easily backported). Currently we tend to provide a patch with a small foot-print to fix the transactionality of the offer timeout and cherry-pick it Mesos versions prior to 0.27.0.

          Show
          alexr Alexander Rukletsov added a comment - Let me elaborate a bit on the issue and possible workarounds. First off, the described situation—when the filter is technically never applied—may happen not even when the allocator is slow or backlogged. For example, if the timeout is set to 10s and allocation interval is 100s . Moreover, a 3rdparty allocator can do allocations in arbitrary manner. However, there is a real problem: idle low share frameworks may block resources in "offer-decline" cycles. Joris Van Remoortere nicely summarized the issue in one sentence: "It's a shame if the 'default' (5s filter) doesn't co-operate well as your cluster scales". We have to fix it. I would argue that the "right" solution to this problem is a combination of quota and suppressing offers. But quota is neither mandatory nor it is available before 0.27.0 (while the fix can be easily backported). Currently we tend to provide a patch with a small foot-print to fix the transactionality of the offer timeout and cherry-pick it Mesos versions prior to 0.27.0.
          Hide
          tnachen Timothy Chen added a comment -

          Seems like we're still discussing what's the right fix (or even if we want to fix anything), which I doubt we can resolve this before end of this week.
          Can we remove the target version for now?

          Show
          tnachen Timothy Chen added a comment - Seems like we're still discussing what's the right fix (or even if we want to fix anything), which I doubt we can resolve this before end of this week. Can we remove the target version for now?
          Hide
          alexr Alexander Rukletsov added a comment -

          We still aim for 0.27. I would say it's better than release 0.27.1 right after.

          Show
          alexr Alexander Rukletsov added a comment - We still aim for 0.27. I would say it's better than release 0.27.1 right after.
          Hide
          klaus1982 Klaus Ma added a comment -

          Regarding Quota, if the one framework can not consume its quota right now, the resources are not used by others. Maybe handled by Optimistic Offer.
          For the solution, when will filter timeout? Take your example to discuss: if the timeout is set to 10s and allocation interval is 100s,
          1. if there's 5s to next allocation, will filter timeout in 10s?
          2. if there's 15s+ to next allocation, will filter timeout in ~100s + 10s?
          3. if there's 10s to next allocation, will filter timeout in 100s?

          Show
          klaus1982 Klaus Ma added a comment - Regarding Quota, if the one framework can not consume its quota right now, the resources are not used by others. Maybe handled by Optimistic Offer. For the solution, when will filter timeout? Take your example to discuss: if the timeout is set to 10s and allocation interval is 100s , 1. if there's 5s to next allocation, will filter timeout in 10s ? 2. if there's 15s+ to next allocation, will filter timeout in ~100s + 10s ? 3. if there's 10s to next allocation, will filter timeout in 100s ?
          Hide
          tnachen Timothy Chen added a comment -

          Ok, the fix has to be there soon as we're starting to put together a release branch. If it doesn't make it over the weekend I'll update the fix version.

          Show
          tnachen Timothy Chen added a comment - Ok, the fix has to be there soon as we're starting to put together a release branch. If it doesn't make it over the weekend I'll update the fix version.
          Show
          alexr Alexander Rukletsov added a comment - - edited https://reviews.apache.org/r/42355/ https://reviews.apache.org/r/42629/
          Hide
          bmahler Benjamin Mahler added a comment -
          commit 447d814ac80e67f30a0ffe2ee6047d85dc8fc383
          Author: Alexander Rukletsov <rukletsov@gmail.com>
          Date:   Thu Jan 21 23:17:22 2016 -0800
          
              Removed the timeout from the offer filter in the allocator.
          
              Without the timeout, we rely on filter expiration only. This guarantees
              that filter removal is scheduled after `allocate()` if the allocator is
              backlogged given default parameters are used. Additionally we ensure the
              filter timeout is at least as big as the allocation interval.
          
              Review: https://reviews.apache.org/r/42355/
          

          Tests:

          commit ecfb8d53da58cc694ef885c929873042618dc16e
          Author: Alexander Rukletsov <rukletsov@gmail.com>
          Date:   Thu Jan 21 23:29:06 2016 -0800
          
              Added tests for offer filters.
          
              Review: https://reviews.apache.org/r/42629/
          
          Show
          bmahler Benjamin Mahler added a comment - commit 447d814ac80e67f30a0ffe2ee6047d85dc8fc383 Author: Alexander Rukletsov <rukletsov@gmail.com> Date: Thu Jan 21 23:17:22 2016 -0800 Removed the timeout from the offer filter in the allocator. Without the timeout, we rely on filter expiration only. This guarantees that filter removal is scheduled after `allocate()` if the allocator is backlogged given default parameters are used. Additionally we ensure the filter timeout is at least as big as the allocation interval. Review: https://reviews.apache.org/r/42355/ Tests: commit ecfb8d53da58cc694ef885c929873042618dc16e Author: Alexander Rukletsov <rukletsov@gmail.com> Date: Thu Jan 21 23:29:06 2016 -0800 Added tests for offer filters. Review: https://reviews.apache.org/r/42629/

            People

            • Assignee:
              alexr Alexander Rukletsov
              Reporter:
              bmahler Benjamin Mahler
              Shepherd:
              Benjamin Mahler
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development

                  Agile