Details

    • Epic Name:
      Oversubscription

      Description

      This proposal is predicated upon offer revocation.

      The idea would be to add a new "revoked" status either by (1) piggybacking off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a new status update TASK_REVOKED.

      In order to augment an offer with metadata about revocability, there are options:
      1) Add a revocable boolean to the Offer and
      a) offer only one type of Offer per slave at a particular time
      b) offer both revocable and non-revocable resources at the same time but require frameworks to understand that Offers can contain overlapping resources
      2) Add a revocable_resources field on the Offer which is a superset of the regular resources field. By consuming > resources <= revocable_resources in a launchTask, the Task becomes a revocable task. If launching a task with < resources, the Task is non-revocable.

      The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) and non-revocable tasks are online higher-SLA tasks (e.g. services.)

      Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk. One of these resources is a rate (4 cpu seconds per second) and two of them are fixed values (8GB and 20GB respectively, though disk resources can be further broken down into spindles - fixed - and iops - a rate.) In practice, these are the maximum resources in the respective dimensions that this task will use. In reality, we provision tasks at some factor below peak, and only hit peak resource consumption in rare circumstances or perhaps at a diurnal peak.

      In the meantime, we stand to gain from offering the some constant factor of the difference between (reserved - actual) of non-revocable tasks as revocable resources, depending upon our tolerance for revocable task churn. The main challenge is coming up with an accurate short / medium / long-term prediction of resource consumption based upon current behavior.

      In many cases it would be OK to be sloppy:

      • CPU / iops / network IO are rates (compressible) and can often be OK below guarantees for brief periods of time while task revocation takes place
      • Memory slack can be provided by enabling swap and dynamically setting swap paging boundaries. Should swap ever be activated, that would be a signal to revoke.

      The master / allocator would piggyback on the slave heartbeat mechanism to learn of the amount of revocable resources available at any point in time.

      1. mesos_virtual_offers.pdf
        813 kB
        Benjamin Hindman

        Issue Links

          Issues in Epic

            Activity

            Hide
            nnielsen Niklas Quarfot Nielsen added a comment -

            Just throwing in a SGTM

            Show
            nnielsen Niklas Quarfot Nielsen added a comment - Just throwing in a SGTM
            Hide
            vinodkone Vinod Kone added a comment -

            Sgtm.

            Also, I realized 'revocable_available' should really be 'revocable' because it represents both allocated and available.

            Considering we might have a different pool of revocable resources from optimistic offers, should we just rename 'revocable' as 'oversubscribed' ?

            Show
            vinodkone Vinod Kone added a comment - Sgtm. Also, I realized 'revocable_available' should really be 'revocable' because it represents both allocated and available. Considering we might have a different pool of revocable resources from optimistic offers, should we just rename 'revocable' as 'oversubscribed' ?
            Hide
            jvanremoortere Joris Van Remoortere added a comment -

            Vinod Kone I think the allocator logic generally makes sense. I would just call out that we will likely want to treat revocable_available differently for resources coming from the resource estimator as opposed to optimistic offers. The reason for that is:
            1) resource estimator updates are a "rude edit", as in they purely overwrite the revocable resources
            2) resources from optimistic offers are increased / decreased based on allocation by the original owner of the resources.

            The same way that we expect the Revocable resources to be flagged differently in the offer protobuf, I think we may want to either:
            1) have separate pools of revocable resources available in the allocator for each source (lender?) of the resource OR
            2) ensure that all revocable resources are introduced into the allocator the same way (as in rude edits, or deltas).

            In general, though, I think the behavior is common between them.

            What do you think?

            Show
            jvanremoortere Joris Van Remoortere added a comment - Vinod Kone I think the allocator logic generally makes sense. I would just call out that we will likely want to treat revocable_available differently for resources coming from the resource estimator as opposed to optimistic offers. The reason for that is: 1) resource estimator updates are a "rude edit", as in they purely overwrite the revocable resources 2) resources from optimistic offers are increased / decreased based on allocation by the original owner of the resources. The same way that we expect the Revocable resources to be flagged differently in the offer protobuf, I think we may want to either: 1) have separate pools of revocable resources available in the allocator for each source (lender?) of the resource OR 2) ensure that all revocable resources are introduced into the allocator the same way (as in rude edits, or deltas). In general, though, I think the behavior is common between them. What do you think?
            Hide
            vinodkone Vinod Kone added a comment -

            This is the high level idea of how the different components (described in the design doc) interact for oversubscription for the MVP.

            --> Resource estimator sends an estimate of 'oversubscribable' resources to the slave.

            --> Slave periodically checks if its cached value of 'revocable resources' (i.e., allocations of revocable containers + oversubscribable resources) has changed. If changed, slave forwards 'revocable resources' to the master.

            --> Master rescinds outstanding revocable offers when it gets new 'revocable resources' estimate and updates the allocator.

            --> On receiving 'revocable resources' update, allocator updates 'revocable_available' (revocable resources - revocable allocation) resources.

            --> 'revocable_available' gets allocated to (and recovered from) frameworks in the same way as 'available' (regular resources).

            --> When sending offers master sends separate offers for revocable and regular resources.

            Some salient features of this proposal:
            --> Allocator changes are minimal.
            --> Slave forwards estimates only when there is a change => low load on master.
            --> Split offers allows master to rescind only revocable resources when necessary.

            Thoughts?

            Show
            vinodkone Vinod Kone added a comment - This is the high level idea of how the different components (described in the design doc) interact for oversubscription for the MVP. --> Resource estimator sends an estimate of 'oversubscribable' resources to the slave. --> Slave periodically checks if its cached value of 'revocable resources' (i.e., allocations of revocable containers + oversubscribable resources) has changed. If changed, slave forwards 'revocable resources' to the master. --> Master rescinds outstanding revocable offers when it gets new 'revocable resources' estimate and updates the allocator. --> On receiving 'revocable resources' update, allocator updates 'revocable_available' (revocable resources - revocable allocation) resources. --> 'revocable_available' gets allocated to (and recovered from) frameworks in the same way as 'available' (regular resources). --> When sending offers master sends separate offers for revocable and regular resources. Some salient features of this proposal: --> Allocator changes are minimal. --> Slave forwards estimates only when there is a change => low load on master. --> Split offers allows master to rescind only revocable resources when necessary. Thoughts?
            Hide
            nnielsen Niklas Quarfot Nielsen added a comment -

            @vinodkone @jieyu - do you want to tag the oversubscription tickets (or a subset) for 0.23.0 (so we make sure the necessary pieces land in the release)

            Show
            nnielsen Niklas Quarfot Nielsen added a comment - @vinodkone @jieyu - do you want to tag the oversubscription tickets (or a subset) for 0.23.0 (so we make sure the necessary pieces land in the release)
            Hide
            nnielsen Niklas Quarfot Nielsen added a comment -

            Folks,

            here is the architecture document we have been working on for introducing oversubscription in Mesos: https://docs.google.com/document/d/1pUnElxHy1uWfHY_FOvvRC73QaOGgdXE0OXN-gbxdXA0/edit#

            It is still work in progress, so feel free to add suggestions and raise concerns.

            Show
            nnielsen Niklas Quarfot Nielsen added a comment - Folks, here is the architecture document we have been working on for introducing oversubscription in Mesos: https://docs.google.com/document/d/1pUnElxHy1uWfHY_FOvvRC73QaOGgdXE0OXN-gbxdXA0/edit# It is still work in progress, so feel free to add suggestions and raise concerns.
            Hide
            nnielsen Niklas Quarfot Nielsen added a comment -

            Oversubscription means many things and can be considered as a subset of the currently ongoing effort in optimistic offers:
            Where optimistic offers lets the allocator to offer resources:

            • In multiple frameworks to increase 'parallelism' (opposed to the conservative/pessimistic scheme) and *increase task throughput*.
            • Preemptable resources from unallocated but reserved resources, to *limit reservation slack* (difference between reserverd and allocated resources).

            A third (and equally important) case, which expands these scenarios is oversubscription of allocated resources which limits the *usage slack* (difference between allocated and used resources).
            There has been a lot of recent research which shows the ability to reduce usage slack with 60% while maintaining the Service Level Objective (SLO) of latency critical workloads(1). However, this kind of oversubscription needs policies and fine-tuning to make sure that best-effort tasks doesn't interfere with latency critical ones. Therefore, we'd like to start a discussion on how such a system would look in Mesos. I will create a JIRA ticket (linking to this one) to start the conversation.

            (1) http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43017.pdf

            Show
            nnielsen Niklas Quarfot Nielsen added a comment - Oversubscription means many things and can be considered as a subset of the currently ongoing effort in optimistic offers: Where optimistic offers lets the allocator to offer resources: In multiple frameworks to increase 'parallelism' (opposed to the conservative/pessimistic scheme) and * increase task throughput *. Preemptable resources from unallocated but reserved resources, to * limit reservation slack * (difference between reserverd and allocated resources). A third (and equally important) case, which expands these scenarios is oversubscription of allocated resources which limits the * usage slack * (difference between allocated and used resources). There has been a lot of recent research which shows the ability to reduce usage slack with 60% while maintaining the Service Level Objective (SLO) of latency critical workloads(1). However, this kind of oversubscription needs policies and fine-tuning to make sure that best-effort tasks doesn't interfere with latency critical ones. Therefore, we'd like to start a discussion on how such a system would look in Mesos. I will create a JIRA ticket (linking to this one) to start the conversation. (1) http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43017.pdf
            Hide
            bmahler Benjamin Mahler added a comment -

            If we introduce additional offers for a slave, considered revokable via a boolean flag, existing schedulers will not be checking this flag and will schedule on the offer on the assumption that the offer semantics haven't changed. Perhaps not a huge issue given there are not a lot of production schedulers out there, but certainly something to keep in mind.

            I think I like the explicitness of revokable offers being separate, since they are to be considered more volatile.

            Show
            bmahler Benjamin Mahler added a comment - If we introduce additional offers for a slave, considered revokable via a boolean flag, existing schedulers will not be checking this flag and will schedule on the offer on the assumption that the offer semantics haven't changed. Perhaps not a huge issue given there are not a lot of production schedulers out there, but certainly something to keep in mind. I think I like the explicitness of revokable offers being separate, since they are to be considered more volatile.
            Hide
            wickman brian wickman added a comment -

            The upside of having a revocable_resources field is that it is backwards compatible with existing frameworks. Overall I feel it would be easier to implement rather than having two classes of Offers.

            I'm not suggesting that we can't revoke tasks that have been allocated from the non-revocable resources. I guess in that sense maybe it should be renamed to batch_resources or lowsla_resources instead.

            Show
            wickman brian wickman added a comment - The upside of having a revocable_resources field is that it is backwards compatible with existing frameworks. Overall I feel it would be easier to implement rather than having two classes of Offers. I'm not suggesting that we can't revoke tasks that have been allocated from the non-revocable resources. I guess in that sense maybe it should be renamed to batch_resources or lowsla_resources instead.
            Hide
            benjaminhindman Benjamin Hindman added a comment -

            Thanks for writing this out Brian.

            I'd prefer to keep things as simple as possible, i.e., a boolean flag per offer indicating whether or not the resources within the offer are revocable. This also captures the design we (Berkeley) had discussed in the past.

            It is true that the current semantics are to only have one offer per slave per offers callback, but I don't think we need to adhere to this going forward. That is, I think it's perfectly reasonable to have two offers for the same slave in the list of offers where one is for revocable resources and one is for non-revocable resources. A long standing desire is to enable schedulers to aggregate offers on the same slave. One could imagine the aggregate only being tainted/revocable if one of the offers contains revocable resources. Enabling aggregate offers might be a great starter project actually (I think there is a JIRA out there for this).

            Also, I'm not keen on handcuffing the allocator with the semantics that non-revocable resources will never be revoked. That is to say, I can imagine sophisticated allocators "reallocating" resources amongst frameworks in order to "defrag" the cluster for better utilization, to turn off machines, or enable running more tasks. We've always played with the idea of masking these revocations as machine failures (i.e., TASK_LOST), assuming that more resources will be allocated to the framework ASAP. But we might be able to capture this more explicitly. For example, one could imagine a "reallocated" callback that offers resources to replace what was revoked. I'm all ears if you have ideas for better capturing these semantics via the API.

            Finally (and related to above), in conjunction with revocation (not so much oversubscription) I'd like to introduce "inverse offers": a request for the scheduler to kill it's own tasks in order to free up resources in the cluster. Like other things in Mesos, this enables the scheduler to be involved in the process if it wants to be (if it doesn't, the system will just decide what to revoke). I'll attach a poster I had previously created with a lot of these ideas.

            Note that it's not clear we need/should design all these bits in order to get to oversubscription. Just adding the revocable boolean might be sufficient for now. I just want to make sure that we don't walk ourselves into a corner where some of these other features/mechanisms will become very tedious to introduce.

            Show
            benjaminhindman Benjamin Hindman added a comment - Thanks for writing this out Brian. I'd prefer to keep things as simple as possible, i.e., a boolean flag per offer indicating whether or not the resources within the offer are revocable. This also captures the design we (Berkeley) had discussed in the past. It is true that the current semantics are to only have one offer per slave per offers callback, but I don't think we need to adhere to this going forward. That is, I think it's perfectly reasonable to have two offers for the same slave in the list of offers where one is for revocable resources and one is for non-revocable resources. A long standing desire is to enable schedulers to aggregate offers on the same slave. One could imagine the aggregate only being tainted/revocable if one of the offers contains revocable resources. Enabling aggregate offers might be a great starter project actually (I think there is a JIRA out there for this). Also, I'm not keen on handcuffing the allocator with the semantics that non-revocable resources will never be revoked. That is to say, I can imagine sophisticated allocators "reallocating" resources amongst frameworks in order to "defrag" the cluster for better utilization, to turn off machines, or enable running more tasks. We've always played with the idea of masking these revocations as machine failures (i.e., TASK_LOST), assuming that more resources will be allocated to the framework ASAP. But we might be able to capture this more explicitly. For example, one could imagine a "reallocated" callback that offers resources to replace what was revoked. I'm all ears if you have ideas for better capturing these semantics via the API. Finally (and related to above), in conjunction with revocation (not so much oversubscription) I'd like to introduce "inverse offers": a request for the scheduler to kill it's own tasks in order to free up resources in the cluster. Like other things in Mesos, this enables the scheduler to be involved in the process if it wants to be (if it doesn't, the system will just decide what to revoke). I'll attach a poster I had previously created with a lot of these ideas. Note that it's not clear we need/should design all these bits in order to get to oversubscription. Just adding the revocable boolean might be sufficient for now. I just want to make sure that we don't walk ourselves into a corner where some of these other features/mechanisms will become very tedious to introduce.

              People

              • Assignee:
                Unassigned
                Reporter:
                wickman brian wickman
              • Votes:
                1 Vote for this issue
                Watchers:
                20 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Development