Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-1418

Automatic, online load balancing

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Normal
    • Resolution: Won't Fix
    • None
    • None

    Description

      Goal

      CASSANDRA-192 began with the intention of implementing full cluster load balancing, but ended up being (wisely) limited to a manual load balancing operation. This issue is an umbrella ticket for finishing the job of implementing automatic, always-on load balancing.

      It is possible to implement very efficient load balancing operations with a single process directing the rebalancing of all nodes, but avoiding such a central process and allowing individual nodes to make their own movement decisions would be ideal.

      Components

      Optimal movements for individual nodes

      Ruhl

      One such approach is the Ruhl algorithm described on 192: https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079 . But as described, it performs excessive movement for large hotspots, and can take a long time to reach equilibrium. Consider the following ring:

      token load
      a 5
      c 5
      e 5
      f 40
      k 5

      Assuming that node 'a' is the first to discover that 'f' is overloaded: it will apply Case 2, and assume half of 'f's load by moving to 'i', leaving both with 20 units. But this is not a optimal movement, because both 'f' and 'a/i' will still be holding data that they will need to give away. Additionally, 'a/i' can't begin giving the data away until it has finished receiving it.

      If node 'e' is the first to discover that 'f' is overloaded, it will apply Case 1, and 'f' will give half of its load to 'e' by moving to 'i'. Again, this is a non-optimal movement, because it will result in both 'e' and 'f/i' holding data that they need to give away.

      Adding load awareness to Ruhl

      Luckily, there appears to be a simple adjustment to the Ruhl algorithm that solves this problem by taking advantage of the fact that Cassandra knows the total load of a cluster, and can use it to calculate the average/ideal load ω. Once node j has decided it should take load from node i (based on the ε value in Ruhl), rather than node j taking 1/2 of the load on node i, it should chose a token such that either i or j ends up with a load within ε*ω of ω.

      Again considering the ring described above, and assuming ε == 1.0, the total load for the 5 nodes is 60, giving a ω of 12. If node 'a' is the first to discover 'f', it will choose to move to 'j' (a token that takes 12 or ω load units from 'f'), leaving 'f' with a load of 28. When combined with the improvement in the next section, this is closer to being an optimal movement, because 'a/j' will at worst have ε*ω of load to give away, and 'f' is in a position to start more movements.

      Automatic load balancing

      Since the Ruhl algorithm only requires a node to make a decision based on itself and one other node, it should be relatively straightforward to add a timer on each node that periodically wakes up and executes the modifiied Ruhl algorithm if it is not already in the process of moving (based on pending ranges).

      Automatic balancing should probably be enabled by default, and should have a configurable per-node bandwidth cap.

      Allowing concurrent moves on a node

      Allowing a node to give away multiple ranges at once allows for the type of quick balancing that is typically only attributed to vnodes. If a node is a hotspot, such as in the example above, the node should be able to quickly dump the load in a manner that causes minimal load on the rest of the cluster. Rather than transferring to 1 target at 10 MB/s, a hotspot can give to 5 targets at 2 MB/s each.

      Attachments

        Activity

          People

            Unassigned Unassigned
            stuhood Stu Hood
            Votes:
            3 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 42h
                42h
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 42h
                42h