Cassandra
  1. Cassandra
  2. CASSANDRA-3833

support arbitrary topology transitions

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      Once we have the locator abstracted (with the gossiper being a
      particular concrete implementation), we want to change the locator
      abstraction to not express changes in ring topology on a per-node
      basis; rather we want to use an abstraction which communicates two
      arbitrary ring states; one state for the read set, and one for the
      write set.

      Once this abstraction is in place, the (pluggable) locator will be
      able to make bulk changes to a ring at once. Main points:

      • Must be careful in handling consistency level during ring
        transitions, such that a given node in the read set corresponds to a
        specific node in the write set. This will impose some restrictions
        on completion of transitions, to avoid code complexity, so it is an
        important point.
      • All code outside of gossip (and any other locator that works
        similarly) will be agnostic about individual changes to nodes, and
        will instead only be notified when new ring states are available (in
        aggregate). This makes the change non-trivial because all code that
        currently is oriented around individual node changes always
        producing a valid ring, will have to be changed.

        Issue Links

          Activity

          Hide
          Peter Schuller added a comment -

          CASSANDRA-3901 explains the basics of the first of the two points listed above, and how I believe this is already not correct in the current version.

          It gets more complicated for arbitrary transitions for the following reason:

          The easiest way to implement arbitrary transitions would be to just require that a transition completes fully, or not at all. This avoids complexity in responsibility calculations, with each node being responsible (in the write set) for the parts of the ring that it will eventually be responsible for when all completes.

          But clearly, from an operator's standpoint, it would be good to allow ad-hoc changing of a transition change. Supposing you're bootstrapping 100 nodes into a cluster and 2 of them turn out to be broken, you'd like to just be able to say 'oh well, nevermind those 2 for now I'll come back to them later [when h/w ix fixed for example]". The problem is that if there is overlap between hosts being inserted into the cluster (I'm using the word "inserted" and assuming node bootstrap for simplicity; the equivalent holds true for any change) other nodes will not have been part of the write set so you cannot just forget about the ones that aren't up yet.

          On way to address this is to not consider overlapping nodes when calculating the write set, preferring to write "too much" data (as the case is today). Another way is to do the full calculations and have additional streaming happen when a topology change is adjusted - but that seems excessively complex.

          Yet a third way is to specifically support the concept of a node which "was supposed to be at token X and this other node Y was bootstrapped with that in mind". This is similar to what was discussed in CASSANDRA-3483 and can get complex.

          Show
          Peter Schuller added a comment - CASSANDRA-3901 explains the basics of the first of the two points listed above, and how I believe this is already not correct in the current version. It gets more complicated for arbitrary transitions for the following reason: The easiest way to implement arbitrary transitions would be to just require that a transition completes fully, or not at all. This avoids complexity in responsibility calculations, with each node being responsible (in the write set) for the parts of the ring that it will eventually be responsible for when all completes. But clearly, from an operator's standpoint, it would be good to allow ad-hoc changing of a transition change. Supposing you're bootstrapping 100 nodes into a cluster and 2 of them turn out to be broken, you'd like to just be able to say 'oh well, nevermind those 2 for now I'll come back to them later [when h/w ix fixed for example] ". The problem is that if there is overlap between hosts being inserted into the cluster (I'm using the word "inserted" and assuming node bootstrap for simplicity; the equivalent holds true for any change) other nodes will not have been part of the write set so you cannot just forget about the ones that aren't up yet. On way to address this is to not consider overlapping nodes when calculating the write set, preferring to write "too much" data (as the case is today). Another way is to do the full calculations and have additional streaming happen when a topology change is adjusted - but that seems excessively complex. Yet a third way is to specifically support the concept of a node which "was supposed to be at token X and this other node Y was bootstrapped with that in mind". This is similar to what was discussed in CASSANDRA-3483 and can get complex.
          Hide
          Peter Schuller added a comment -

          I forgot another thing: If one goes for the option of writing to all possible endpoints that might be up, it also has the effect of significantly increasing the probability of Unavailable exceptions. Consider that you're streaming data in a replica set to 3 different nodes that have overlapping future ownership (pending ranges). If you need to allow each node going up individually, not only do you need to write more as mentioned above, every single request that touches data that is shared between one existing host in the replica set and multiple potential futures, would have to be required for consistency purposes.

          In other words, whenever you're performing a write, you essentially have to keep track of the current read set, the write set, and the possible "future equivalence" between the two. If data is moving from A to B, with A being in the read set and B being in the write set (read set = A, write set = AB (using RF=1 for simplicity)), any write satisfied with the help of an ACK from A must also be satisfied by B. Otherwise the data is effectively lost once finishes the join.

          For arbitrary ring changes the number of overlaps could potentially be higher than 1. For example if you have RF=3 and triple the size of a cluster, you're now inserting two nodes in between each pre-existing node. So for any replica set for a row key with read set ABC, the future read set will be AXY (XY being bootstrapped nodes), and while they're joining, the write set will be AXYBC. But any write satisfying CL via A, must also be ack:ed by X and Y because we don't know which one of them will finish bootstrapping first.

          Show
          Peter Schuller added a comment - I forgot another thing: If one goes for the option of writing to all possible endpoints that might be up, it also has the effect of significantly increasing the probability of Unavailable exceptions. Consider that you're streaming data in a replica set to 3 different nodes that have overlapping future ownership (pending ranges). If you need to allow each node going up individually, not only do you need to write more as mentioned above, every single request that touches data that is shared between one existing host in the replica set and multiple potential futures, would have to be required for consistency purposes. In other words, whenever you're performing a write, you essentially have to keep track of the current read set, the write set, and the possible "future equivalence" between the two. If data is moving from A to B, with A being in the read set and B being in the write set (read set = A, write set = AB (using RF=1 for simplicity)), any write satisfied with the help of an ACK from A must also be satisfied by B. Otherwise the data is effectively lost once finishes the join. For arbitrary ring changes the number of overlaps could potentially be higher than 1. For example if you have RF=3 and triple the size of a cluster, you're now inserting two nodes in between each pre-existing node. So for any replica set for a row key with read set ABC, the future read set will be AXY (XY being bootstrapped nodes), and while they're joining, the write set will be AXYBC. But any write satisfying CL via A, must also be ack:ed by X and Y because we don't know which one of them will finish bootstrapping first.
          Hide
          Peter Schuller added a comment -

          It should be noted though that the "we don't know who goes up first" is only a result of using a token ring as the underlying abstraction. One either has to add the logic to work around this when considering the token ring, or alternatively one can move away from the token ring as the underlying abstraction (which can in turn be considered mapping ring segments, but the token ring becomes more of an implementation detail than fundamental model then).

          More on this in the future when this ticket is closer to reality.

          Show
          Peter Schuller added a comment - It should be noted though that the "we don't know who goes up first" is only a result of using a token ring as the underlying abstraction. One either has to add the logic to work around this when considering the token ring, or alternatively one can move away from the token ring as the underlying abstraction (which can in turn be considered mapping ring segments, but the token ring becomes more of an implementation detail than fundamental model then). More on this in the future when this ticket is closer to reality.

            People

            • Assignee:
              Peter Schuller
              Reporter:
              Peter Schuller
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development