Details
-
Epic
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
Oversubscription
Description
This proposal is predicated upon offer revocation.
The idea would be to add a new "revoked" status either by (1) piggybacking off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a new status update TASK_REVOKED.
In order to augment an offer with metadata about revocability, there are options:
1) Add a revocable boolean to the Offer and
a) offer only one type of Offer per slave at a particular time
b) offer both revocable and non-revocable resources at the same time but require frameworks to understand that Offers can contain overlapping resources
2) Add a revocable_resources field on the Offer which is a superset of the regular resources field. By consuming > resources <= revocable_resources in a launchTask, the Task becomes a revocable task. If launching a task with < resources, the Task is non-revocable.
The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) and non-revocable tasks are online higher-SLA tasks (e.g. services.)
Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk. One of these resources is a rate (4 cpu seconds per second) and two of them are fixed values (8GB and 20GB respectively, though disk resources can be further broken down into spindles - fixed - and iops - a rate.) In practice, these are the maximum resources in the respective dimensions that this task will use. In reality, we provision tasks at some factor below peak, and only hit peak resource consumption in rare circumstances or perhaps at a diurnal peak.
In the meantime, we stand to gain from offering the some constant factor of the difference between (reserved - actual) of non-revocable tasks as revocable resources, depending upon our tolerance for revocable task churn. The main challenge is coming up with an accurate short / medium / long-term prediction of resource consumption based upon current behavior.
In many cases it would be OK to be sloppy:
- CPU / iops / network IO are rates (compressible) and can often be OK below guarantees for brief periods of time while task revocation takes place
- Memory slack can be provided by enabling swap and dynamically setting swap paging boundaries. Should swap ever be activated, that would be a signal to revoke.
The master / allocator would piggyback on the slave heartbeat mechanism to learn of the amount of revocable resources available at any point in time.
Attachments
Attachments
Issue Links
- is related to
-
SPARK-10293 Add support for oversubscription in Mesos
- Resolved
Issues in epic
|
MESOS-2035 | Add reason to containerizer proto Termination | Resolved | Jie Yu | ||
|
MESOS-2653 | Slave should act on correction events from QoS controller | Resolved | Niklas Quarfot Nielsen | ||
|
MESOS-2704 | Add tests for QoS controller corrections | Resolved | Niklas Quarfot Nielsen | ||
|
MESOS-2817 | Support revocable/non-revocable CPU updates in Mesos containerizer | Resolved | Ian Downes | ||
MESOS-2688 | Slave should kill usage slack revocable tasks if oversubscription is disabled | Accepted | Jie Yu | |||
|
MESOS-2651 | Implement QoS controller | Resolved | Niklas Quarfot Nielsen | ||
|
MESOS-2650 | Modularize the Resource Estimator | Resolved | Bartek Plotka | ||
|
MESOS-2652 | Update Mesos containerizer to understand revocable cpu resources | Resolved | Ian Downes | ||
|
MESOS-2693 | Printing a resource should show information about reservation, disk etc | Resolved | Brian Wickman | ||
|
MESOS-2733 | Update master to handle oversubscribed resource estimate from the slave | Resolved | Vinod Kone | ||
|
MESOS-2691 | Update Resource message to include revocable resources | Resolved | Vinod Kone | ||
|
MESOS-2649 | Implement Resource Estimator | Resolved | Jie Yu | ||
|
MESOS-2689 | Slave should forward oversubscribable resources to the master | Resolved | Jie Yu | ||
|
MESOS-2695 | Add master flag to enable/disable oversubscription | Resolved | Unassigned | ||
MESOS-2646 | Update Master to send revocable resources in separate offers | Reviewable | Yongqiao Wang | |||
|
MESOS-2645 | Design doc for resource oversubscription | Resolved | Niklas Quarfot Nielsen | ||
MESOS-2647 | Slave should validate tasks using oversubscribed resources | Reviewable | Guangya Liu | |||
|
MESOS-2648 | Update Resource Monitor to return resource usage | Resolved | Niklas Quarfot Nielsen | ||
|
MESOS-2654 | Update FrameworkInfo to opt in to revocable resources | Resolved | Vinod Kone | ||
|
MESOS-2655 | Implement a stand alone test framework that uses revocable cpu resources | Resolved | Benjamin Mahler | ||
|
MESOS-2687 | Add a slave flag to enable oversubscription | Resolved | Unassigned | ||
|
MESOS-2700 | Determine CFS behavior with biased cpu.shares subtrees | Resolved | Unassigned | ||
|
MESOS-2701 | Implement bi-level cpu.shares subtrees in cgroups/cpu isolator. | Resolved | Unassigned | ||
|
MESOS-2702 | Compare split/flattened cgroup hierarchy for CPU oversubscription | Resolved | Unassigned | ||
|
MESOS-2703 | Modularize the QoS Controller | Resolved | Niklas Quarfot Nielsen | ||
|
MESOS-2729 | Update DRF sorter to update total resources | Resolved | Vinod Kone | ||
|
MESOS-2730 | Add a new API call to the allocator to update oversubscribed resources | Resolved | Vinod Kone | ||
|
MESOS-2734 | Update allocator to allocate revocable resources | Resolved | Vinod Kone | ||
|
MESOS-2735 | Change the interaction between the slave and the resource estimator from polling to pushing | Resolved | Jie Yu | ||
|
MESOS-2741 | Exposing Resources along with ResourceStatistics from resource monitor | Resolved | haosdent | ||
|
MESOS-2760 | Add correction message to inform slave about QoS Controller actions | Resolved | Bartek Plotka | ||
|
MESOS-2764 | Allow Resource Estimator to get Resource Usage information. | Resolved | Bartek Plotka | ||
|
MESOS-2770 | Slave should forward total amount of oversubscribed resources to the master | Resolved | Vinod Kone | ||
|
MESOS-2772 | Define protobuf for ResourceMonitor::Usage. | Resolved | Bartek Plotka | ||
|
MESOS-2773 | Pass callback to the resource estimator to retrieve ResourceUsage from Resource Monitor on demand. | Resolved | Bartek Plotka | ||
|
MESOS-2775 | Slave should expose metrics about oversubscribed resources | Resolved | Benjamin Mahler | ||
|
MESOS-2804 | Log framework capabilities in the master. | Resolved | Benjamin Mahler | ||
|
MESOS-2808 | Slave should call into resource estimator whenever it wants to forward oversubscribed resources | Resolved | Vinod Kone | ||
|
MESOS-2823 | Pass callback to the QoS Controller to retrieve ResourceUsage from Resource Monitor on demand. | Resolved | Bartek Plotka | ||
|
MESOS-2753 | Master should validate tasks using oversubscribed resources | Resolved | Vinod Kone | ||
|
MESOS-2791 | Create a FixedResourceEstimator to return fixed amount of oversubscribable resources. | Resolved | Jie Yu | ||
|
MESOS-2818 | Pass 'allocated' resources for each executor to the resource estimator. | Resolved | Jie Yu | ||
|
MESOS-2776 | Master should expose metrics about oversubscribed resources | Resolved | Yan Xu | ||
MESOS-2845 | Command tasks lead to a mixing of revocable / non-revocable cpus and memory within the container. | Accepted | Unassigned | |||
|
MESOS-2866 | Slave should send oversubscribed resource information after master failover. | Resolved | Benjamin Mahler | ||
|
MESOS-2869 | OversubscriptionTest.FixedResourceEstimator is flaky | Resolved | Jie Yu | ||
|
MESOS-2919 | Framework can overcommit oversubscribable resources during master failover. | Resolved | Jie Yu | ||
MESOS-2930 | Allow the Resource Estimator to express over-allocation of revocable resources. | Accepted | Unassigned | |||
|
MESOS-2933 | Pass slave's total resources to the ResourceEstimator and QoSController via Slave::usage(). | Resolved | Bartek Plotka | ||
|
MESOS-3563 | Revocable task CPU shows as zero in /state.json | Resolved | Vinod Kone | ||
|
MESOS-4076 | Create simple LoadQoSController which will evict revocable executors when system load is too high. | Resolved | Bartek Plotka | ||
|
MESOS-4442 | `allocated` may have more resources then `total` in allocator | Resolved | Klaus Ma |