Currently the getBestClustering algorithm for RAS finds the "Best" cluster/rack based on which rack has the most available resources this may be insufficient and may cause topologies not to be able to be scheduled successfully even though there are enough resources to schedule it in the cluster. We attempt to find the rack with the most resources by find the rack with the biggest sum of available memory + available cpu. This method is not effective since it does not consider the number of slots available. This method also fails in identifying racks that are not schedulable due to the exhaustion of one of the resources either memory, cpu, or slots. The current implementation also tries the initial scheduling on one rack and not try to schedule on all the racks before giving up which may cause topologies to be failed to be scheduled due to the above mentioned shortcomings in the current method. Also the current method does not consider failures of workers. When executors of a topology gets unassigned and needs to be scheduled again, the current logic in getBestClustering may be inadequate if not complete wrong. When executors needs to rescheduled due to a fault, getBestClustering will likely return a cluster that is different from where the majority of executors from the topology is originally scheduling in.
Thus, I propose a different strategy/algorithm to find the "best" cluster. I have come up with a ordering strategy I dub subordinate resource availability ordering (inspired by Dominant Resource Fairness) that sorts racks by the subordinate (not dominant) resource availability.
For example given 4 racks with the following resource availabilities
It is clear that rack-0 is the best cluster since its the most balanced and can potentially schedule the most executors, while rack-2 is the worst rack since rack-2 is depleted of cpu resource thus rendering it unschedulable even though there are other resources available.
We first calculate the resource availability percentage of all the racks for each resource by computing:
We do this calculation to normalize the values otherwise the resource values would not be comparable.
So for our example:
The effective resource of a rack, which is also the subordinate resource, is computed by:
Then we order the racks by the effective resource.
Thus for our example:
Also to deal with the presence of failures, if a topology is partially scheduled, we find the rack with the most scheduled executors for the topology and we try to schedule on that rack first.
Thus for the sorting for racks. We first sort by the number of executors already scheduled on the rack and then by the subordinate resource availability.