Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-957

convenience workflow for replacing dead node

    Details

    • Type: Wish
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: Tools
    • Labels:
      None

      Description

      Replacing a dead node with a new one is a common operation, but "nodetool removetoken" followed by bootstrap is inefficient (re-replicating data first to the remaining nodes, then to the new one) and manually bootstrapping to a token "just less than" the old one's, followed by "nodetool removetoken" is slightly painful and prone to manual errors.

      First question: how would you expose this in our tool ecosystem? It needs to be a startup-time option to the new node, so it can't be nodetool, and messing with the config xml definitely takes the "convenience" out. A one-off -DreplaceToken=XXY argument?

        Issue Links

          Activity

          Hide
          kingryan Ryan King added a comment -

          It would be easier if a node could start without joining the ring: CASSANDRA-526.

          Show
          kingryan Ryan King added a comment - It would be easier if a node could start without joining the ring: CASSANDRA-526 .
          Hide
          lenn0x Chris Goffinet added a comment -

          Found when testing using RF=2 that there is a chance that the local node could be included in the workMap, and be the first node ordered by proximity. workMap uses the first node from the list as a source. This would cause sending Bootstrap stream sessions to itself.

          Show
          lenn0x Chris Goffinet added a comment - Found when testing using RF=2 that there is a chance that the local node could be included in the workMap, and be the first node ordered by proximity. workMap uses the first node from the list as a source. This would cause sending Bootstrap stream sessions to itself.
          Hide
          nickmbailey Nick Bailey added a comment -

          What happens if I bring up a node with the same ip and bootstrap on but forget the replace option? It looks like it will try to bootstrap to an auto picked token. Am I reading that right?

          What happens if I accidentally give the wrong token with the replace option? If I accidentally give the token for a live node will it try to bootstrap to the same position?

          Show
          nickmbailey Nick Bailey added a comment - What happens if I bring up a node with the same ip and bootstrap on but forget the replace option? It looks like it will try to bootstrap to an auto picked token. Am I reading that right? What happens if I accidentally give the wrong token with the replace option? If I accidentally give the token for a live node will it try to bootstrap to the same position?
          Hide
          lenn0x Chris Goffinet added a comment -

          Good observation. I'll work on covering both use cases.

          Show
          lenn0x Chris Goffinet added a comment - Good observation. I'll work on covering both use cases.
          Hide
          jbellis Jonathan Ellis added a comment -

          Chris, are you still working on this?

          Show
          jbellis Jonathan Ellis added a comment - Chris, are you still working on this?
          Hide
          lenn0x Chris Goffinet added a comment -

          yes. ill post an update next week with an updated patchset.

          Show
          lenn0x Chris Goffinet added a comment - yes. ill post an update next week with an updated patchset.
          Hide
          lenn0x Chris Goffinet added a comment -

          I am going to defer this to anyone else who would like to pick up this ticket. I just do not have the spare time to focus on this.

          Show
          lenn0x Chris Goffinet added a comment - I am going to defer this to anyone else who would like to pick up this ticket. I just do not have the spare time to focus on this.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Thanks Chris, I will give it a try...

          Show
          vijay2win@yahoo.com Vijay added a comment - Thanks Chris, I will give it a try...
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Adding support for replacing token.... This also supports replacement with the same IP. Reworked Hints to be based on Token instead of IP's.... The 3rd part of the patch also makes the hints be delivered to the host (currently seems like it is not delivered at all... Let me know if you want to move this to a different ticket).

          Show
          vijay2win@yahoo.com Vijay added a comment - Adding support for replacing token.... This also supports replacement with the same IP. Reworked Hints to be based on Token instead of IP's.... The 3rd part of the patch also makes the hints be delivered to the host (currently seems like it is not delivered at all... Let me know if you want to move this to a different ticket).
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Reported to 2496, Testing it took longer than expected time...

          Show
          vijay2win@yahoo.com Vijay added a comment - Reported to 2496, Testing it took longer than expected time...
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Seems like CASSANDRA-2928 fixes the hints issue... so we can ignore 0003 in this ticket.

          Show
          vijay2win@yahoo.com Vijay added a comment - Seems like CASSANDRA-2928 fixes the hints issue... so we can ignore 0003 in this ticket.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Rebased.... with a better way to look for operator error.

          Show
          vijay2win@yahoo.com Vijay added a comment - Rebased.... with a better way to look for operator error.
          Hide
          nickmbailey Nick Bailey added a comment -

          So a few questions:

          • In Gossiper.doStatusCheck() you made it ignore any state that is for the local endpoint and is not a dead state. Shouldn't it just always ignore any state about the local endpoint though? Basically what it was doing previously?
          • Basically the same question about Gossiper.applyStateLocally() the loop continues if the state is for the local node and the state is dead. Why would we want to apply a live local state?
          • Does the hibernate state need the true/false value? Seems like all we care about is that it is set at all. Looks like we we are starting up right now we automatically go into a hibernate state, then we go into a bootstrap state afterwards if the specified a replace token. Seems like we shouldn't set a state at all until we know we are doing one of replace/bootstrap/just joining.
          • It looks like right now you could specify a replace token that isn't part of the cluster. If that happens we should throw an exception and tell the user to do the normal bootstrap process.
          • Why use the last gossip time to determine if the node we are replacing is alive? Why not just check gossip to see if the ring thinks it is alive?
          • We should update the the message for the exception that is thrown when you try to bootstrap to an existing token. It should indicate either remove the dead node or follow this replacement process.
          • I'm not sure why we are calling updateNormalToken() in the StorageService.bootstrap() method when it's a token replacement.
          • A little bit of doc on this would be good, maybe in cassandra.yaml? Just on how to pass the argument to the startup process.

          I also need to dive into the hint stuff a little bit more, I'm less familiar with that code.

          Show
          nickmbailey Nick Bailey added a comment - So a few questions: In Gossiper.doStatusCheck() you made it ignore any state that is for the local endpoint and is not a dead state. Shouldn't it just always ignore any state about the local endpoint though? Basically what it was doing previously? Basically the same question about Gossiper.applyStateLocally() the loop continues if the state is for the local node and the state is dead. Why would we want to apply a live local state? Does the hibernate state need the true/false value? Seems like all we care about is that it is set at all. Looks like we we are starting up right now we automatically go into a hibernate state, then we go into a bootstrap state afterwards if the specified a replace token. Seems like we shouldn't set a state at all until we know we are doing one of replace/bootstrap/just joining. It looks like right now you could specify a replace token that isn't part of the cluster. If that happens we should throw an exception and tell the user to do the normal bootstrap process. Why use the last gossip time to determine if the node we are replacing is alive? Why not just check gossip to see if the ring thinks it is alive? We should update the the message for the exception that is thrown when you try to bootstrap to an existing token. It should indicate either remove the dead node or follow this replacement process. I'm not sure why we are calling updateNormalToken() in the StorageService.bootstrap() method when it's a token replacement. A little bit of doc on this would be good, maybe in cassandra.yaml? Just on how to pass the argument to the startup process. I also need to dive into the hint stuff a little bit more, I'm less familiar with that code.
          Hide
          nickmbailey Nick Bailey added a comment -

          The hint rework looks good. The only comment I have there is that it would be nice if the logging statements for sending hints creating hints indicated the ip as well as the token. Even though it's stored by token it would be nice to immediately see the ip in the log without having to look it up.

          I'm also unsure about the reasoning behind the last patch. Why increase the initial sleep in joinTokenRing?

          Show
          nickmbailey Nick Bailey added a comment - The hint rework looks good. The only comment I have there is that it would be nice if the logging statements for sending hints creating hints indicated the ip as well as the token. Even though it's stored by token it would be nice to immediately see the ip in the log without having to look it up. I'm also unsure about the reasoning behind the last patch. Why increase the initial sleep in joinTokenRing?
          Hide
          vijay2win@yahoo.com Vijay added a comment -
          • In Gossiper.doStatusCheck() you made it ignore any state that is for the local endpoint and is not a dead state. Shouldn't it just always ignore any state about the local endpoint though? Basically what it was doing previously?
          • Basically the same question about Gossiper.applyStateLocally() the loop continues if the state is for the local node and the state is dead. Why would we want to apply a live local state?
            -> Fixed, initial intention was to find the old state of the node, Seems like it is not possible now…
          • Does the hibernate state need the true/false value? Seems like all we care about is that it is set at all. Looks like we we are starting up right now we automatically go into a hibernate state, then we go into a bootstrap state afterwards if the specified a replace token. Seems like we shouldn't set a state at all until we know we are doing one of replace/bootstrap/just joining.
            -> it will be either true or false (If not a replace, or overwrite with the state normal)… if you don't then Gossiper.applyStateLocally will mark it alive on all the other nodes.
          • It looks like right now you could specify a replace token that isn't part of the cluster. If that happens we should throw an exception and tell the user to do the normal bootstrap process.
            -> As we are ignoring the local states… this information is hard to gather when we are trying to replace the same node…. The check is to see no other live node owns this token….
            -> We can document in the wiki about the effects if they replace a token which is not part of the ring…. (repair/decommission)
          • Why use the last gossip time to determine if the node we are replacing is alive? Why not just check gossip to see if the ring thinks it is alive?
            -> because by default when we hear about someone we consider them to be alive…. the idea is to check and see if we heard from them back or not (After the ring delay) if not then there is more probability that the dead node is dead (Thats why we have to wait for 90 + delay
          • We should update the the message for the exception that is thrown when you try to bootstrap to an existing token. It should indicate either remove the dead node or follow this replacement process.
            -> I am not sure if i parse that, i have added more to it plz check.
          • I'm not sure why we are calling updateNormalToken() in the StorageService.bootstrap() method when it's a token replacement.
            -> Thats because you don't want the range request sent to the node which is not existing.
          • A little bit of doc on this would be good, maybe in cassandra.yaml? Just on how to pass the argument to the startup process.
            -> Yaml is bad because this is a one time thing…. Wiki page? like the don't join ring property
          Show
          vijay2win@yahoo.com Vijay added a comment - In Gossiper.doStatusCheck() you made it ignore any state that is for the local endpoint and is not a dead state. Shouldn't it just always ignore any state about the local endpoint though? Basically what it was doing previously? Basically the same question about Gossiper.applyStateLocally() the loop continues if the state is for the local node and the state is dead. Why would we want to apply a live local state? -> Fixed, initial intention was to find the old state of the node, Seems like it is not possible now… Does the hibernate state need the true/false value? Seems like all we care about is that it is set at all. Looks like we we are starting up right now we automatically go into a hibernate state, then we go into a bootstrap state afterwards if the specified a replace token. Seems like we shouldn't set a state at all until we know we are doing one of replace/bootstrap/just joining. -> it will be either true or false (If not a replace, or overwrite with the state normal)… if you don't then Gossiper.applyStateLocally will mark it alive on all the other nodes. It looks like right now you could specify a replace token that isn't part of the cluster. If that happens we should throw an exception and tell the user to do the normal bootstrap process. -> As we are ignoring the local states… this information is hard to gather when we are trying to replace the same node…. The check is to see no other live node owns this token…. -> We can document in the wiki about the effects if they replace a token which is not part of the ring…. (repair/decommission) Why use the last gossip time to determine if the node we are replacing is alive? Why not just check gossip to see if the ring thinks it is alive? -> because by default when we hear about someone we consider them to be alive…. the idea is to check and see if we heard from them back or not (After the ring delay) if not then there is more probability that the dead node is dead (Thats why we have to wait for 90 + delay We should update the the message for the exception that is thrown when you try to bootstrap to an existing token. It should indicate either remove the dead node or follow this replacement process. -> I am not sure if i parse that, i have added more to it plz check. I'm not sure why we are calling updateNormalToken() in the StorageService.bootstrap() method when it's a token replacement. -> Thats because you don't want the range request sent to the node which is not existing. A little bit of doc on this would be good, maybe in cassandra.yaml? Just on how to pass the argument to the startup process. -> Yaml is bad because this is a one time thing…. Wiki page? like the don't join ring property
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Attaching newer version with fix and rebase.

          Show
          vijay2win@yahoo.com Vijay added a comment - Attaching newer version with fix and rebase.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          I'm also unsure about the reasoning behind the last patch. Why increase the initial sleep in joinTokenRing?
          – Ring delay + extra time so we can check if there is any live server before actually replacing the node.

          Show
          vijay2win@yahoo.com Vijay added a comment - I'm also unsure about the reasoning behind the last patch. Why increase the initial sleep in joinTokenRing? – Ring delay + extra time so we can check if there is any live server before actually replacing the node.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Added more logging.... for RMV i am not sure if we have to parse the string to token and then to ip.

          Show
          vijay2win@yahoo.com Vijay added a comment - Added more logging.... for RMV i am not sure if we have to parse the string to token and then to ip.
          Hide
          nickmbailey Nick Bailey added a comment -

          Just a couple more things

          • Looks like some tests are failing with these patches applied. Want to look into that?
          • Can you revert the logging change in gossiper for when a node restarts? You changed it to, 'logger.info("Node {} has been restarted...", ep);' which isn't particularly more informative. The old logging statement allowed you to easily grep for 'UP' in the logs.
          • Can we only sleep for BROADCAST_INTERVAL if we are actually replacing a token? No reason to add extra sleeping to the normal bootstrap case.
          • Nitpick: you added some random whitespace throughout those 2 patches at the ends of lines. Can we remove that when you fix the above.

          Another thing, when we document this let's be sure to mention that clients should not connect to a node that is replacing another node until the process is complete. They will see inconsistent results if they do since gossiper always assumes the local node is alive.

          Show
          nickmbailey Nick Bailey added a comment - Just a couple more things Looks like some tests are failing with these patches applied. Want to look into that? Can you revert the logging change in gossiper for when a node restarts? You changed it to, 'logger.info("Node {} has been restarted...", ep);' which isn't particularly more informative. The old logging statement allowed you to easily grep for 'UP' in the logs. Can we only sleep for BROADCAST_INTERVAL if we are actually replacing a token? No reason to add extra sleeping to the normal bootstrap case. Nitpick: you added some random whitespace throughout those 2 patches at the ends of lines. Can we remove that when you fix the above. Another thing, when we document this let's be sure to mention that clients should not connect to a node that is replacing another node until the process is complete. They will see inconsistent results if they do since gossiper always assumes the local node is alive.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          • Looks like some tests are failing with these patches applied. Want to look into that?
          Fixed somehow this got missed from v1 to v4...
          • Can you revert the logging change in gossiper for when a node restarts? You changed it to, 'logger.info("Node {} has been restarted...", ep);' which isn't particularly more informative. The old logging statement allowed you
          to easily grep for 'UP' in the logs.
          Done!
          • Can we only sleep for BROADCAST_INTERVAL if we are actually replacing a token? No reason to add extra sleeping to the normal bootstrap case.
          If we do that we will not be able to check the update interval because all the messages will be within the delay interval…. sleeping little more will make allow us to identify the node isAlive or not (if in case the replacing node with a different IP is alive) - Plz check EndpointState(HeartBeatState initialHbState)

          • Nitpick: you added some random whitespace throughout those 2 patches at the ends of lines. Can we remove that when you fix the above.
          Not sure where you see it.

          Show
          vijay2win@yahoo.com Vijay added a comment - • Looks like some tests are failing with these patches applied. Want to look into that? Fixed somehow this got missed from v1 to v4... • Can you revert the logging change in gossiper for when a node restarts? You changed it to, 'logger.info("Node {} has been restarted...", ep);' which isn't particularly more informative. The old logging statement allowed you to easily grep for 'UP' in the logs. Done! • Can we only sleep for BROADCAST_INTERVAL if we are actually replacing a token? No reason to add extra sleeping to the normal bootstrap case. If we do that we will not be able to check the update interval because all the messages will be within the delay interval…. sleeping little more will make allow us to identify the node isAlive or not (if in case the replacing node with a different IP is alive) - Plz check EndpointState(HeartBeatState initialHbState) • Nitpick: you added some random whitespace throughout those 2 patches at the ends of lines. Can we remove that when you fix the above. Not sure where you see it.
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Rebased with the new hints.

          Show
          vijay2win@yahoo.com Vijay added a comment - Rebased with the new hints.
          Hide
          nickmbailey Nick Bailey added a comment -

          Looks like an onRestart() method was added for the state subscribers. I think it was accidental but looks like your change removes that call and replaces it with markDead(). Was that accidental?

          Show
          nickmbailey Nick Bailey added a comment - Looks like an onRestart() method was added for the state subscribers. I think it was accidental but looks like your change removes that call and replaces it with markDead(). Was that accidental?
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Yeah sorry for the oversight on onRestart() but markAsDead() is intentional otherwise the if the node is coming with the same IP even though we dont markAsAlive we will see the traffic flowing though it as it is because of the gossip hence we make sure we dont do that until it is marked alive....

          Show
          vijay2win@yahoo.com Vijay added a comment - Yeah sorry for the oversight on onRestart() but markAsDead() is intentional otherwise the if the node is coming with the same IP even though we dont markAsAlive we will see the traffic flowing though it as it is because of the gossip hence we make sure we dont do that until it is marked alive....
          Hide
          nickmbailey Nick Bailey added a comment -

          +1 on 0001 v7 and 0002 v6

          Show
          nickmbailey Nick Bailey added a comment - +1 on 0001 v7 and 0002 v6
          Hide
          jbellis Jonathan Ellis added a comment -

          Can you add a short how-to to NEWS.txt describing this feature?

          Show
          jbellis Jonathan Ellis added a comment - Can you add a short how-to to NEWS.txt describing this feature?
          Hide
          vijay2win@yahoo.com Vijay added a comment -

          Done! + added some documentation @ http://wiki.apache.org/cassandra/Operations letme know if you want to move it to somewhere else. Thanks...

          Show
          vijay2win@yahoo.com Vijay added a comment - Done! + added some documentation @ http://wiki.apache.org/cassandra/Operations letme know if you want to move it to somewhere else. Thanks...
          Hide
          jbellis Jonathan Ellis added a comment -

          committed, thanks Vijay and Nick!

          Show
          jbellis Jonathan Ellis added a comment - committed, thanks Vijay and Nick!
          Hide
          hudson Hudson added a comment -

          Integrated in Cassandra #1076 (See https://builds.apache.org/job/Cassandra/1076/)
          convenience workflow for replacing dead node
          patch by Vijay; reviewed by Nick Bailey for CASSANDRA-957

          jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1165468
          Files :

          • /cassandra/trunk/NEWS.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/config/DatabaseDescriptor.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/HintedHandOffManager.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java
          • /cassandra/trunk/src/java/org/apache/cassandra/dht/BootStrapper.java
          • /cassandra/trunk/src/java/org/apache/cassandra/gms/EndpointState.java
          • /cassandra/trunk/src/java/org/apache/cassandra/gms/Gossiper.java
          • /cassandra/trunk/src/java/org/apache/cassandra/gms/VersionedValue.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/LoadBroadcaster.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/MigrationManager.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java
          • /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java
          Show
          hudson Hudson added a comment - Integrated in Cassandra #1076 (See https://builds.apache.org/job/Cassandra/1076/ ) convenience workflow for replacing dead node patch by Vijay; reviewed by Nick Bailey for CASSANDRA-957 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1165468 Files : /cassandra/trunk/NEWS.txt /cassandra/trunk/src/java/org/apache/cassandra/config/DatabaseDescriptor.java /cassandra/trunk/src/java/org/apache/cassandra/db/HintedHandOffManager.java /cassandra/trunk/src/java/org/apache/cassandra/db/RowMutation.java /cassandra/trunk/src/java/org/apache/cassandra/dht/BootStrapper.java /cassandra/trunk/src/java/org/apache/cassandra/gms/EndpointState.java /cassandra/trunk/src/java/org/apache/cassandra/gms/Gossiper.java /cassandra/trunk/src/java/org/apache/cassandra/gms/VersionedValue.java /cassandra/trunk/src/java/org/apache/cassandra/service/LoadBroadcaster.java /cassandra/trunk/src/java/org/apache/cassandra/service/MigrationManager.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageProxy.java /cassandra/trunk/src/java/org/apache/cassandra/service/StorageService.java

            People

            • Assignee:
              vijay2win@yahoo.com Vijay
              Reporter:
              jbellis Jonathan Ellis
              Reviewer:
              Nick Bailey
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development