Resolution: Won't Fix
Fix Version/s: None
CASSANDRA-192 began with the intention of implementing full cluster load balancing, but ended up being (wisely) limited to a manual load balancing operation. This issue is an umbrella ticket for finishing the job of implementing automatic, always-on load balancing.
It is possible to implement very efficient load balancing operations with a single process directing the rebalancing of all nodes, but avoiding such a central process and allowing individual nodes to make their own movement decisions would be ideal.
One such approach is the Ruhl algorithm described on 192: https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079 . But as described, it performs excessive movement for large hotspots, and can take a long time to reach equilibrium. Consider the following ring:
Assuming that node 'a' is the first to discover that 'f' is overloaded: it will apply Case 2, and assume half of 'f's load by moving to 'i', leaving both with 20 units. But this is not a optimal movement, because both 'f' and 'a/i' will still be holding data that they will need to give away. Additionally, 'a/i' can't begin giving the data away until it has finished receiving it.
If node 'e' is the first to discover that 'f' is overloaded, it will apply Case 1, and 'f' will give half of its load to 'e' by moving to 'i'. Again, this is a non-optimal movement, because it will result in both 'e' and 'f/i' holding data that they need to give away.
Luckily, there appears to be a simple adjustment to the Ruhl algorithm that solves this problem by taking advantage of the fact that Cassandra knows the total load of a cluster, and can use it to calculate the average/ideal load ω. Once node j has decided it should take load from node i (based on the ε value in Ruhl), rather than node j taking 1/2 of the load on node i, it should chose a token such that either i or j ends up with a load within ε*ω of ω.
Again considering the ring described above, and assuming ε == 1.0, the total load for the 5 nodes is 60, giving a ω of 12. If node 'a' is the first to discover 'f', it will choose to move to 'j' (a token that takes 12 or ω load units from 'f'), leaving 'f' with a load of 28. When combined with the improvement in the next section, this is closer to being an optimal movement, because 'a/j' will at worst have ε*ω of load to give away, and 'f' is in a position to start more movements.
Since the Ruhl algorithm only requires a node to make a decision based on itself and one other node, it should be relatively straightforward to add a timer on each node that periodically wakes up and executes the modifiied Ruhl algorithm if it is not already in the process of moving (based on pending ranges).
Automatic balancing should probably be enabled by default, and should have a configurable per-node bandwidth cap.
Allowing a node to give away multiple ranges at once allows for the type of quick balancing that is typically only attributed to vnodes. If a node is a hotspot, such as in the example above, the node should be able to quickly dump the load in a manner that causes minimal load on the rest of the cluster. Rather than transferring to 1 target at 10 MB/s, a hotspot can give to 5 targets at 2 MB/s each.