Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1
-
None
Description
With a growing number of connected frameworks, the allocation time grows to very high numbers. The addition of quota in 0.27 had an additional impact on these numbers. Running `mesos-tests.sh --benchmark --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us the following numbers:
[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers Using 2000 slaves and 200 frameworks round 0 allocate took 2.921202secs to make 200 offers round 1 allocate took 2.85045secs to make 200 offers round 2 allocate took 2.823768secs to make 200 offers
Increasing the number of frameworks to 2000:
[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers Using 2000 slaves and 2000 frameworks round 0 allocate took 28.209454secs to make 2000 offers round 1 allocate took 28.469419secs to make 2000 offers round 2 allocate took 28.138086secs to make 2000 offers
I was able to reduce this time by a substantial amount. After applying the patches:
[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers Using 2000 slaves and 200 frameworks round 0 allocate took 1.016226secs to make 2000 offers round 1 allocate took 1.102729secs to make 2000 offers round 2 allocate took 1.102624secs to make 2000 offers
And with 2000 frameworks:
[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers Using 2000 slaves and 2000 frameworks round 0 allocate took 12.563203secs to make 2000 offers round 1 allocate took 12.437517secs to make 2000 offers round 2 allocate took 12.470708secs to make 2000 offers
The patches do 3 things to improve the performance of the allocator.
1) The total values in the DRFSorter will be pre calculated per resource type
2) In the allocate method, when no resources are available to allocate, we break out of the innermost loop to prevent looping over a large number of frameworks when we have nothing to allocate
3) when a framework suppresses offers, we remove it from the sorter instead of just calling continue in the allocation loop - this greatly improves performance in the sorter and prevents looping over frameworks that don't need resources
Assuming that most of the frameworks behave nicely and suppress offers when they have nothing to schedule, it is fair to assume, that point 3) has the biggest impact on the performance. If we suppress offers for 90% of the frameworks in the benchmark test, we see following numbers:
==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers Using 200 slaves and 2000 frameworks round 0 allocate took 11626us to make 200 offers round 1 allocate took 22890us to make 200 offers round 2 allocate took 21346us to make 200 offers
And for 200 frameworks:
[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers Using 2000 slaves and 2000 frameworks round 0 allocate took 1.11178secs to make 2000 offers round 1 allocate took 1.062649secs to make 2000 offers round 2 allocate took 1.080181secs to make 2000 offers
Review requests:
https://reviews.apache.org/r/43665/
https://reviews.apache.org/r/43666/
https://reviews.apache.org/r/43668/
Attachments
Issue Links
- is related to
-
MESOS-5781 Benchmark allocation with framework suppression.
- Reviewable
- relates to
-
MESOS-3157 Only perform periodic resource allocations.
- Resolved
-
MESOS-5279 DRF sorter add/activate doesn't check if it's adding a duplicate entry
- Resolved