Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-4694

DRFAllocator takes very long to allocate resources with a large number of frameworks

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.26.0, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1
    • 1.1.0
    • allocation
    • None

    Description

      With a growing number of connected frameworks, the allocation time grows to very high numbers. The addition of quota in 0.27 had an additional impact on these numbers. Running `mesos-tests.sh --benchmark --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us the following numbers:

      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
      [ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
      Using 2000 slaves and 200 frameworks
      round 0 allocate took 2.921202secs to make 200 offers
      round 1 allocate took 2.85045secs to make 200 offers
      round 2 allocate took 2.823768secs to make 200 offers
      

      Increasing the number of frameworks to 2000:

      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
      [ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
      Using 2000 slaves and 2000 frameworks
      round 0 allocate took 28.209454secs to make 2000 offers
      round 1 allocate took 28.469419secs to make 2000 offers
      round 2 allocate took 28.138086secs to make 2000 offers
      

      I was able to reduce this time by a substantial amount. After applying the patches:

      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
      [ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
      Using 2000 slaves and 200 frameworks
      round 0 allocate took 1.016226secs to make 2000 offers
      round 1 allocate took 1.102729secs to make 2000 offers
      round 2 allocate took 1.102624secs to make 2000 offers
      

      And with 2000 frameworks:

      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
      [ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
      Using 2000 slaves and 2000 frameworks
      round 0 allocate took 12.563203secs to make 2000 offers
      round 1 allocate took 12.437517secs to make 2000 offers
      round 2 allocate took 12.470708secs to make 2000 offers
      

      The patches do 3 things to improve the performance of the allocator.

      1) The total values in the DRFSorter will be pre calculated per resource type

      2) In the allocate method, when no resources are available to allocate, we break out of the innermost loop to prevent looping over a large number of frameworks when we have nothing to allocate

      3) when a framework suppresses offers, we remove it from the sorter instead of just calling continue in the allocation loop - this greatly improves performance in the sorter and prevents looping over frameworks that don't need resources

      Assuming that most of the frameworks behave nicely and suppress offers when they have nothing to schedule, it is fair to assume, that point 3) has the biggest impact on the performance. If we suppress offers for 90% of the frameworks in the benchmark test, we see following numbers:

      ==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
      [ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
      Using 200 slaves and 2000 frameworks
      round 0 allocate took 11626us to make 200 offers
      round 1 allocate took 22890us to make 200 offers
      round 2 allocate took 21346us to make 200 offers
      

      And for 200 frameworks:

      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
      [ RUN      ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
      Using 2000 slaves and 2000 frameworks
      round 0 allocate took 1.11178secs to make 2000 offers
      round 1 allocate took 1.062649secs to make 2000 offers
      round 2 allocate took 1.080181secs to make 2000 offers
      

      Review requests:

      https://reviews.apache.org/r/43665/
      https://reviews.apache.org/r/43666/
      https://reviews.apache.org/r/43668/

      Attachments

        Issue Links

          Activity

            People

              drexin Dario Rexin
              drexin Dario Rexin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: