Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9806

Address allocator performance regression due to the addition of quota limits.

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9.0
    • Component/s: allocation
    • Target Version/s:
    • Sprint:
      Resource Mgmt: RI-17 Sprint 53
    • Story Points:
      5

      Description

      In MESOS-9802, we removed the quota role sorter which is tech debt.

      However, this slows down the allocator. The problem is that in the first stage, even though a cluster might have no active roles with non-default quota, the allocator will now have to sort and go through each and every role in the cluster. Benchmark result shows that for 1k roles with 2k frameworks, the allocator could experience ~50% performance degradation.

      There are a couple of ways to address this issue. For example, we could make the sorter aware of quota. And add a method, say `sortQuotaRoles`, to return all the roles with non-default quota. Alternatively, an even better approach would be to deprecate the sorter concept and just have two standalone functions e.g. sortRoles() and sortQuotaRoles() that takes in the role tree structure (not yet exist in the allocator) and return the sorted roles.

      In addition, when implementing MESOS-8068, we need to do more during the allocation cycle. In particular, we need to call shrink many more times than before. These all contribute to the performance slowdown. Specifically, for the quota oriented benchmark `HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2` we can observe 2-3x slowdown compared to the previous release (1.8.1):

      Current master:

      QuotaParam/BENCHMARK_HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
      Benchmark setup: 3000 agents, 3000 roles, 3000 frameworks, with drf sorter
      Made 3500 allocations in 32.051382735secs
      Made 0 allocation in 27.976022773secs

      1.8.1:
      HierarchicalAllocator_WithQuotaParam.LargeAndSmallQuota/2
      Made 3500 allocations in 13.810811063secs
      Made 0 allocation in 9.885972984secs

        Attachments

          Activity

            People

            • Assignee:
              mzhu Meng Zhu
              Reporter:
              mzhu Meng Zhu

              Dates

              • Created:
                Updated:
                Resolved:

                Agile

                  Issue deployment