In our environment where there are 1.5k frameworks and quota is heavily utilized, we would experience a severe resource fragmentation issue. Specifically, we observed a large number of port-less offers circulating in the cluster. Thus frameworks that need port resources are not able to launch tasks even if their roles have quota (because currently, we can only set quota for scalar resources, not port range resources).
While most of the 1.5k frameworks do not suppress today and we believe the situation will significantly improve once they do. Still, I think there are some improvements the Mesos allocator can make to help.
The origin of these port-less offers stems from quota chopping. Specifically, when chopping an agent to satisfy a role’s quota, we will also hand out resources that this role does not have quota for (as long as it does not break other role’s quota). These “extra resources” certainly includes ALL the remaining port resources on the agent. After this offer, the agent will be left with no port resources even though it still has CPUs and etc. Later, these resources may be offered to other frameworks but they are useless due to no ports. Now we have some “bad offers” in the cluster.
A resource offer, once it is declined (e.g. due to no ports), is recovered by the allocator and offered to other frameworks again. Before this happens, it is possible that this offer might be able to merge with either the remaining resources or other declined resources on the same agent. However, it is conceivable that not uncommonly, the declined offer will be hand out again as-is. This is especially probable if the allocator makes offers faster than the framework offer response time. As a result, we will observe the circulation of bad offers across different frameworks. These bad offers will exist for a long time before being consolidated again. For how long? The longevity of the bad offer will be roughly proportional to the number of active frameworks. In the worse case, once all the active frameworks have (hopefully long) declined the bad offer, the bad offer will have nowhere to go and finally start to merge with other resources on that agent.
Note, since the allocator performance has greatly improved in the past several months. The scenario described here could be increasingly common. Also, as we introduce quota limits and hierarchical quota, there will be much more agent chopping, making resource fragmentation even worse.
As mentioned above, the longevity of a bad offer is proportional to the active frameworks. Thus framework suppression will certainly help. In addition, from the Mesos side, a couple of mitigation measures are worth considering (other than the long-term optimistic allocation strategy):
1. Adding a defragment interval once in a while in the allocator. For example, each minute or a dozen allocation cycles or so, we will pause the allocation, rescind all the offers and start allocating again. This essentially eliminates all the circulating bad offers by giving them a chance to be consolidated. Think of this as a periodic “reboot” of the allocator.
2. Consider chopping non-quota resources as well. Right now, for resources such as ports (or any other resources that the role does not have quota for), all are allocated in a single offer. We could choose to chop these non-quota resources as well. For example, port resources can be distributed proportionally to allocated CPU resources.
3. Provide support for specifying port quantities. With this, we can utilize the existing quota or `min_allocatable_resources` APIs to guarantee a certain number of port resources.