Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None

      Description

      Currently, there is advice to users to "wait two minutes between adding new nodes" in order for new node tokens, et al, to propagate. Further, as there's no coordination amongst joining node wrt token selection, new nodes can end up selecting ranges that overlap with other joining nodes. This causes a lot of duplicate streaming from the existing source nodes as they shovel out the bootstrap data for those new nodes.

      This ticket proposes creating a mechanism that allows strongly consistent membership and ownership changes in cassandra such that changes are performed in a linearizable and safe manner. The basic idea is to use LWT operations over a global system table, and leverage the linearizability of LWT for ensuring the safety of cluster membership/ownership state changes. This work is inspired by Riak's claimant module.

      The existing workflows for node join, decommission, remove, replace, and range move (there may be others I'm not thinking of) will need to be modified to participate in this scheme, as well as changes to nodetool to enable them.

      Note: we distinguish between membership and ownership in the following ways: for membership we mean "a host in this cluster and it's state". For ownership, we mean "what tokens (or ranges) does each node own"; these nodes must already be a member to be assigned tokens.

      A rough draft sketch of how the 'add new node' workflow might look like is: new nodes would no longer create tokens themselves, but instead contact a member of a Paxos cohort (via a seed). The cohort member will generate the tokens and execute a LWT transaction, ensuring a linearizable change to the membership/ownership state. The updated state will then be disseminated via the existing gossip.

      As for joining specifically, I think we could support two modes: auto-mode and manual-mode. Auto-mode is for adding a single new node per LWT operation, and would require no operator intervention (much like today). In manual-mode, however, multiple new nodes could (somehow) signal their their intent to join to the cluster, but will wait until an operator executes a nodetool command that will trigger the token generation and LWT operation for all pending new nodes. This will allow us better range partitioning and will make the bootstrap streaming more efficient as we won't have overlapping range requests.

        Issue Links

          Activity

          Hide
          jasobrown Jason Brown added a comment - - edited

          As a back of the envelope example, the workflow for adding a new node could look something like this:

          auto-join mode (for a single node)

          1. new node comes up, contacts seed node for membership/ownership info
            • node needs to contact seed as new node does node have existing membership info (the seed does)
          2. seed will immediately return current membership and ownership info, but no tokens for the new node (see next steps)
          3. seed will generate tokens (as per new TokenAllocation class)
          4. execute LWT operation
          5. "broadcast" out the update to existing cluster nodes
            • existing gossip is probably sufficient for this
          6. seed sends tokens to new node, at which point the new node can start bootstrap streaming data from peers
            • an alternative could be to let new node learn about it's tokens via gossip

          manual join (can be done for one or multiple nodes)

          1. new node(s) come up, contact seed(s) to let them know we are joining but want to be part of a transaction
          2. seed will immediately return current membership and ownership info, but no tokens for the new node (see next steps)
          3. operator executes "nodetool joinall" on any existing node. optionally, operator can pass in explicit IP addresses to be added
            1. seed will generate tokens (as per new TokenAllocation class)
            2. execute LWT operation
            3. "broadcast" out the update to existing cluster nodes
          4. seed sends tokens to new node(s), at which point the new node can start bootstrap streaming data from peers
          Show
          jasobrown Jason Brown added a comment - - edited As a back of the envelope example, the workflow for adding a new node could look something like this: auto-join mode (for a single node) new node comes up, contacts seed node for membership/ownership info node needs to contact seed as new node does node have existing membership info (the seed does) seed will immediately return current membership and ownership info, but no tokens for the new node (see next steps) seed will generate tokens (as per new TokenAllocation class) execute LWT operation "broadcast" out the update to existing cluster nodes existing gossip is probably sufficient for this seed sends tokens to new node, at which point the new node can start bootstrap streaming data from peers an alternative could be to let new node learn about it's tokens via gossip manual join (can be done for one or multiple nodes) new node(s) come up, contact seed(s) to let them know we are joining but want to be part of a transaction seed will immediately return current membership and ownership info, but no tokens for the new node (see next steps) operator executes "nodetool joinall" on any existing node. optionally, operator can pass in explicit IP addresses to be added 1. seed will generate tokens (as per new TokenAllocation class) 2. execute LWT operation 3. "broadcast" out the update to existing cluster nodes seed sends tokens to new node(s), at which point the new node can start bootstrap streaming data from peers
          Hide
          tjake T Jake Luciani added a comment - - edited

          This is a big step forward and will be a big help for adding many nodes at once. Do you think this approach can be extended to allow more consistent schema changes like changing RF or altering a table?

          Also, for your manual join, what kind of information can we give to users to allow them to evaluate the pending transaction? For example, if a user is going to bootstrap a single node on a new rack it will become the replica for all ranges in the other rack. It would be nice to see the estimated data to be sent between nodes etc.

          Show
          tjake T Jake Luciani added a comment - - edited This is a big step forward and will be a big help for adding many nodes at once. Do you think this approach can be extended to allow more consistent schema changes like changing RF or altering a table? Also, for your manual join, what kind of information can we give to users to allow them to evaluate the pending transaction? For example, if a user is going to bootstrap a single node on a new rack it will become the replica for all ranges in the other rack. It would be nice to see the estimated data to be sent between nodes etc.
          Hide
          jasobrown Jason Brown added a comment -

          Do you think this approach can be extended to allow more consistent schema changes like changing RF or altering a table?

          I think that would be more of a function of the underlying paxos/LWT/consensus alg (which may or may not be the existing LWT, still considering and debating), more so than the overall membership changes. But I would hope the consensus alg work here would apply to other efforts, as well!

          Also, for your manual join, what kind of information can we give to users to allow them to evaluate the pending transaction?

          Initially I was only thinking of showing the minimal info: IP addr (or other host info) and possibly any token info (like if the node is replace another, or the operator is explicitly setting tokens). That being said, we could display any amount of info we choose - the initial set was only bounded by my imagination . However, I really do like your idea about being able to determine the amount of data to be streamed to the new node - something like that should be a reasonably simple calculation and certainly helpful for operators.

          Note: I'm still ironing out the protocol and transition points, but let me post the updates in a short while.

          Show
          jasobrown Jason Brown added a comment - Do you think this approach can be extended to allow more consistent schema changes like changing RF or altering a table? I think that would be more of a function of the underlying paxos/LWT/consensus alg (which may or may not be the existing LWT, still considering and debating), more so than the overall membership changes. But I would hope the consensus alg work here would apply to other efforts, as well! Also, for your manual join, what kind of information can we give to users to allow them to evaluate the pending transaction? Initially I was only thinking of showing the minimal info: IP addr (or other host info) and possibly any token info (like if the node is replace another, or the operator is explicitly setting tokens). That being said, we could display any amount of info we choose - the initial set was only bounded by my imagination . However, I really do like your idea about being able to determine the amount of data to be streamed to the new node - something like that should be a reasonably simple calculation and certainly helpful for operators. Note: I'm still ironing out the protocol and transition points, but let me post the updates in a short while.
          Hide
          jasobrown Jason Brown added a comment -

          updated manual join protocol:

          We require two "consensus transactions"(1) to become a full token-owner in the cluster: one transaction to become a member, and another to take ownership of some tokens(2). So, the protocol would look like this:

          1. new node comes up, and performs consensus transaction to become a member. Will probably need to contact a seed to perform transaction, and should happen automatically, without operator intervention.
            1. transaction adds new node to the member set, as well as incrementing a logical clock(2)
            2. commit the transaction (retry on failure)
            3. updated member set/logical clock is gossiped - perhaps also directly message the joining node, as well (sending it the total member set and metadata).
            4. node is allowed to participate in gossip after becoming a member.
          2. Repeat for each new node to be added to the cluster.
          3. operator can inoke 'nodetool show-pending-nodes' (or whatever we call the command) to see a plan of the current nodes waiting to be assigned tokens (and become part of the ring). Operator can confirm that everything looks as expected.
          4. Operator invokes 'nodetool joinall' to start the next consensus transaction, to make new nodes owners of tokens:
            1. proposer of transaction auto-generates tokens for nodes that did not declare any (like a node replace),
            2. updates the member set with tokens for each node, and increments the logical clock
            3. commit the transaction (retry on failure)
            4. updated member set/increments is gossiped - perhaps also directly message the transitioning nodes, as well.
            5. transitioning nodes can change their status themselves, and start bootstrapping (if necessary)

          Note: if we generate token in the ownership step (as I think we should, for optiminzing the token selection), then we cannot show the 'pending ranges/size to transfer' in the 'nodetool show-pending-nodes' command output (as we don't know all the nodes/tokens yet) as T Jake Luciani proposed. However, we might be able to get away with displaying it after the owner consensus transaction is complete. Alternatively, we could run the same alg that generates the tokens to 'predict' what the tokens will look like, and display that as a potentially non-exact (but close enough) approximation of what the cluster will look like before executing the transaction.

          1) I'm not completely sure of using LWT for the necessary linearizability operation (for safe cluster state transitions), so I'll use the stand-in "consensus transaction" for now.

          2) Tokens may be assigned manually by an operator, derived from another node that is being replaced, or auto-generated by the a proposer of the consensus transaction (think proposer in Paxos).

          3) Logical clock - an monotonic counter that indicates the linearized changes to the member set (either adding/removing nodes, or changes to tokens of a member). Basically, an integer that is incremented on every modification.

          Show
          jasobrown Jason Brown added a comment - updated manual join protocol: We require two "consensus transactions"(1) to become a full token-owner in the cluster: one transaction to become a member, and another to take ownership of some tokens(2). So, the protocol would look like this: new node comes up, and performs consensus transaction to become a member. Will probably need to contact a seed to perform transaction, and should happen automatically, without operator intervention. transaction adds new node to the member set, as well as incrementing a logical clock(2) commit the transaction (retry on failure) updated member set/logical clock is gossiped - perhaps also directly message the joining node, as well (sending it the total member set and metadata). node is allowed to participate in gossip after becoming a member. Repeat for each new node to be added to the cluster. operator can inoke 'nodetool show-pending-nodes' (or whatever we call the command) to see a plan of the current nodes waiting to be assigned tokens (and become part of the ring). Operator can confirm that everything looks as expected. Operator invokes 'nodetool joinall' to start the next consensus transaction, to make new nodes owners of tokens: proposer of transaction auto-generates tokens for nodes that did not declare any (like a node replace), updates the member set with tokens for each node, and increments the logical clock commit the transaction (retry on failure) updated member set/increments is gossiped - perhaps also directly message the transitioning nodes, as well. transitioning nodes can change their status themselves, and start bootstrapping (if necessary) Note: if we generate token in the ownership step (as I think we should, for optiminzing the token selection), then we cannot show the 'pending ranges/size to transfer' in the 'nodetool show-pending-nodes' command output (as we don't know all the nodes/tokens yet) as T Jake Luciani proposed. However, we might be able to get away with displaying it after the owner consensus transaction is complete. Alternatively, we could run the same alg that generates the tokens to 'predict' what the tokens will look like, and display that as a potentially non-exact (but close enough) approximation of what the cluster will look like before executing the transaction. 1) I'm not completely sure of using LWT for the necessary linearizability operation (for safe cluster state transitions), so I'll use the stand-in "consensus transaction" for now. 2) Tokens may be assigned manually by an operator, derived from another node that is being replaced, or auto-generated by the a proposer of the consensus transaction (think proposer in Paxos). 3) Logical clock - an monotonic counter that indicates the linearized changes to the member set (either adding/removing nodes, or changes to tokens of a member). Basically, an integer that is incremented on every modification.

            People

            • Assignee:
              Unassigned
              Reporter:
              jasobrown Jason Brown
              Reviewer:
              Jason Brown
            • Votes:
              4 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

              • Created:
                Updated:

                Development