Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-5916

gossip and tokenMetadata get hostId out of sync on failed replace_node with the same IP address

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.11, 2.0.2
    • Component/s: None
    • Labels:
      None

      Description

      If you try to replace_node an existing, live hostId, it will error out. However if you're using an existing IP to do this (as in, you chose the wrong uuid to replace on accident) then the newly generated hostId wipes out the old one in TMD, and when you do try to replace it replace_node will complain it does not exist. Examination of gossipinfo still shows the old hostId, however now you can't replace it either.

      1. 5916-v4.txt
        21 kB
        Brandon Williams
      2. 5916-v3.txt
        20 kB
        Brandon Williams
      3. 5916-v2.txt
        19 kB
        Brandon Williams
      4. 5916.txt
        17 kB
        Brandon Williams

        Issue Links

          Activity

          Hide
          brandon.williams Brandon Williams added a comment -

          The problem runs a little deeper, too: even if you specify the right uuid, and the replace fails for whatever reason, now they're out of sync again and you can't do the replace at all.

          Show
          brandon.williams Brandon Williams added a comment - The problem runs a little deeper, too: even if you specify the right uuid, and the replace fails for whatever reason, now they're out of sync again and you can't do the replace at all.
          Hide
          brandon.williams Brandon Williams added a comment -

          This same behavior also occurs with replace_token.

          Show
          brandon.williams Brandon Williams added a comment - This same behavior also occurs with replace_token.
          Hide
          brandon.williams Brandon Williams added a comment - - edited

          This isn't so much a problem with retrying the replace, as it is with the same IP address (which won't work at all currently.) The reason for this is that by using the same IP address, the replacing node itself changes the HOST_ID, and then can't find the old one. It's not just as simple as not advertising a new HOST_ID either, since by not having one but modifying STATUS we wipe out any existing HOST_ID as well.

          Show
          brandon.williams Brandon Williams added a comment - - edited This isn't so much a problem with retrying the replace, as it is with the same IP address (which won't work at all currently.) The reason for this is that by using the same IP address, the replacing node itself changes the HOST_ID, and then can't find the old one. It's not just as simple as not advertising a new HOST_ID either, since by not having one but modifying STATUS we wipe out any existing HOST_ID as well.
          Hide
          brandon.williams Brandon Williams added a comment -

          Here's my first (working) attempt at solving this. This patch disables replace_[token,node] and adds a new replace_address. In some ways replace_address seems more intuitive, but really we have to do it this way because we're going to pull everything else we need out of gossip, and endpoints are keyed by address.

          We use a special gossip operation I'm calling 'shadow gossip' where we use a generation of zero and only do a single, half-round. This means we send an empty SYN with our own blank digest to a seed, accept one ACK and then stop the gossip round there, so as not to perturb any existing state.

          From there we extract the original HOST_ID and tokens, and use those for the replacement process. A catch here though is once our gossiper actually starts, we'll knock both the TOKENS state and the existing STATUS state (for single token replacements) out with our newer, real generation, so if the replace fails past this point, we can't retry. It may be possible to stay in shadow gossip mode through all of the process to get around that (and just remove the hibernate state), but I haven't tried this.

          Show
          brandon.williams Brandon Williams added a comment - Here's my first (working) attempt at solving this. This patch disables replace_ [token,node] and adds a new replace_address. In some ways replace_address seems more intuitive, but really we have to do it this way because we're going to pull everything else we need out of gossip, and endpoints are keyed by address. We use a special gossip operation I'm calling 'shadow gossip' where we use a generation of zero and only do a single, half-round. This means we send an empty SYN with our own blank digest to a seed, accept one ACK and then stop the gossip round there, so as not to perturb any existing state. From there we extract the original HOST_ID and tokens, and use those for the replacement process. A catch here though is once our gossiper actually starts, we'll knock both the TOKENS state and the existing STATUS state (for single token replacements) out with our newer, real generation, so if the replace fails past this point, we can't retry. It may be possible to stay in shadow gossip mode through all of the process to get around that (and just remove the hibernate state), but I haven't tried this.
          Hide
          ravilr Ravi Prasad added a comment -

          Tested the patch applied against 1.2.10 and it works. Hints replay also works now after replace/bootstrap. Regarding the corner case, where replace fails to finish after gossiper started with new generation, hence knocking out the TOKENS state, does it make sense to allow the operator to specify replace_token with the token(s) along with the replace_address to recover from such scenario. the token list is logged during the first attempt already.
          I think remaining in shadow mode may not work optimally well for cases where the node being replaced was down for more than hint window. So, all the nodes would have stopped hinting, and after replace, it would require repair to be ran to get the new data fed during the replace.

          Show
          ravilr Ravi Prasad added a comment - Tested the patch applied against 1.2.10 and it works. Hints replay also works now after replace/bootstrap. Regarding the corner case, where replace fails to finish after gossiper started with new generation, hence knocking out the TOKENS state, does it make sense to allow the operator to specify replace_token with the token(s) along with the replace_address to recover from such scenario. the token list is logged during the first attempt already. I think remaining in shadow mode may not work optimally well for cases where the node being replaced was down for more than hint window. So, all the nodes would have stopped hinting, and after replace, it would require repair to be ran to get the new data fed during the replace.
          Hide
          brandon.williams Brandon Williams added a comment -

          First, thanks for testing, Ravi Prasad!

          does it make sense to allow the operator to specify replace_token with the token(s) along with the replace_address to recover

          That could work, but I find it a bit ugly and confusing, especially since replace_token alone is supposed to work right now, but does not.

          I think remaining in shadow mode may not work optimally well for cases where the node being replaced was down for more than hint window. So, all the nodes would have stopped hinting, and after replace, it would require repair to be ran to get the new data fed during the replace.

          That is true regardless of shadow mode though, since hibernate is a dead state and the node doesn't go live to reset the hint timer until the replace has completed.

          Show
          brandon.williams Brandon Williams added a comment - First, thanks for testing, Ravi Prasad ! does it make sense to allow the operator to specify replace_token with the token(s) along with the replace_address to recover That could work, but I find it a bit ugly and confusing, especially since replace_token alone is supposed to work right now, but does not. I think remaining in shadow mode may not work optimally well for cases where the node being replaced was down for more than hint window. So, all the nodes would have stopped hinting, and after replace, it would require repair to be ran to get the new data fed during the replace. That is true regardless of shadow mode though, since hibernate is a dead state and the node doesn't go live to reset the hint timer until the replace has completed.
          Hide
          ravilr Ravi Prasad added a comment - - edited

          That is true regardless of shadow mode though, since hibernate is a dead state and the node doesn't go live to reset the hint timer until the replace has completed.

          my understanding is, due to the generation change of the replacing node, gossiper.handleMajorStateChange marks the node as dead, as hibernate is one of the DEAD_STATES. So, the other nodes marks the replacing node as dead before the token bootstrap starts, hence should be storing hints to the replacing node from that point. Am i reading it wrong?

          Show
          ravilr Ravi Prasad added a comment - - edited That is true regardless of shadow mode though, since hibernate is a dead state and the node doesn't go live to reset the hint timer until the replace has completed. my understanding is, due to the generation change of the replacing node, gossiper.handleMajorStateChange marks the node as dead, as hibernate is one of the DEAD_STATES. So, the other nodes marks the replacing node as dead before the token bootstrap starts, hence should be storing hints to the replacing node from that point. Am i reading it wrong?
          Hide
          brandon.williams Brandon Williams added a comment -

          You're right, it will change the endpoint's expire time and reset the window. That said, once the bootstrap has started the node should be receiving any incoming writes for the range it owns, so 'new' hints shouldn't matter in the common case where it succeeds.

          Show
          brandon.williams Brandon Williams added a comment - You're right, it will change the endpoint's expire time and reset the window. That said, once the bootstrap has started the node should be receiving any incoming writes for the range it owns, so 'new' hints shouldn't matter in the common case where it succeeds.
          Hide
          ravilr Ravi Prasad added a comment -

          once the bootstrap has started the node should be receiving any incoming writes for the range it owns, so 'new' hints shouldn't matter in the common case where it succeeds.

          Is this true for node bootstrapping in hibernate state? From what i have observed, writes to hibernate'd node during its bootstrap are not sent to it, as gossip marks that node down right.

          Show
          ravilr Ravi Prasad added a comment - once the bootstrap has started the node should be receiving any incoming writes for the range it owns, so 'new' hints shouldn't matter in the common case where it succeeds. Is this true for node bootstrapping in hibernate state? From what i have observed, writes to hibernate'd node during its bootstrap are not sent to it, as gossip marks that node down right.
          Hide
          brandon.williams Brandon Williams added a comment -

          It's not true for replacing, not only because we're down but also because we don't do any pending range announcement since there's no point.

          I'd be fine with telling people they need to have a large enough hint window to complete the replace to avoid needing to repair, but we have to spin up 'real' gossip to get the schema anyway, so staying in shadow mode the entire time won't work.

          However, there is a relatively simple way to have our cake (automatically extended hint window) and eat it too (be able to retry on failure and not have to specify anything new.) As soon as we receive the tokens via shadow gossip, we can set them ourselves along with the hibernate state. When we spin up the full gossip mode to get the schema, we'll be using the same HOST_ID and TOKENS that we grabbed, so if anything goes wrong at that point we can just grab them again next time.

          This just leaves the issue of checking that the host is really dead, but this doesn't make any sense when replacing with the same IP anyway, so we can skip it when the addresses match.

          v2 does all of this and includes a few other minor cleanups.

          Show
          brandon.williams Brandon Williams added a comment - It's not true for replacing, not only because we're down but also because we don't do any pending range announcement since there's no point. I'd be fine with telling people they need to have a large enough hint window to complete the replace to avoid needing to repair, but we have to spin up 'real' gossip to get the schema anyway, so staying in shadow mode the entire time won't work. However, there is a relatively simple way to have our cake (automatically extended hint window) and eat it too (be able to retry on failure and not have to specify anything new.) As soon as we receive the tokens via shadow gossip, we can set them ourselves along with the hibernate state. When we spin up the full gossip mode to get the schema, we'll be using the same HOST_ID and TOKENS that we grabbed, so if anything goes wrong at that point we can just grab them again next time. This just leaves the issue of checking that the host is really dead, but this doesn't make any sense when replacing with the same IP anyway, so we can skip it when the addresses match. v2 does all of this and includes a few other minor cleanups.
          Hide
          thobbs Tyler Hobbs added a comment -

          I'm testing this out with a three-node ccm cluster. If I do the following:

          1. (optional) stop node3
          2. add a blank node4
          3. start node4 with replace_address=127.0.0.3

          I'll get the following:

          ERROR 16:29:02,689 Exception encountered during startup
          java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't exist in gossip
              at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421)
              at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623)
              at org.apache.cassandra.service.StorageService.initServer(StorageService.java:604)
              at org.apache.cassandra.service.StorageService.initServer(StorageService.java:501)
              at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
              at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
              at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
          java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't exist in gossip
              at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421)
              at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623)
              at org.apache.cassandra.service.StorageService.initServer(StorageService.java:604)
              at org.apache.cassandra.service.StorageService.initServer(StorageService.java:501)
              at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
              at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
              at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
          Exception encountered during startup: Cannot replace_address /127.0.0.3because it doesn't exist in gossip
          ERROR 16:29:02,692 Exception in thread Thread[StorageServiceShutdownHook,5,main]
          java.lang.NullPointerException
              at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
              at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
              at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
              at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:569)
              at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
              at java.lang.Thread.run(Thread.java:724)
          

          This happens whether node3 is up or down. It seems like this problem occurs any time replace_address doesn't match the broadcast address.

          Show
          thobbs Tyler Hobbs added a comment - I'm testing this out with a three-node ccm cluster. If I do the following: (optional) stop node3 add a blank node4 start node4 with replace_address=127.0.0.3 I'll get the following: ERROR 16:29:02,689 Exception encountered during startup java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't exist in gossip at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:604) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:501) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) java.lang.RuntimeException: Cannot replace_address /127.0.0.3because it doesn't exist in gossip at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:421) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:623) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:604) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:501) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) Exception encountered during startup: Cannot replace_address /127.0.0.3because it doesn't exist in gossip ERROR 16:29:02,692 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:569) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:724) This happens whether node3 is up or down. It seems like this problem occurs any time replace_address doesn't match the broadcast address.
          Hide
          brandon.williams Brandon Williams added a comment -

          There are two distinct cases here: we replace 'ourself' with the same IP, or we replace a dead node with a new IP (ala ec2.) We can't know which one we're doing a priori, so we shadow gossip. If we're replacing the same IP, our shadow SYN will contain it, and the remote node will ACK with what we need.

          If we're not replacing with the same IP, there's a problem: an ACK will only contain what was present in the SYN digest list. One could argue this is the sender being naive, since it obviously knows the node that sent the SYN doesn't have some states that it does, but I think at scale this makes sense since it's possible a third node has begun gossiping with the SYN sender, too. In any case, I don't want to change that behavior at this point.

          The other problem is, we can't just sit around and wait for someone to send us a populated SYN either, since we're not a part of gossip and we're new. But we don't know we're new yet, and can't insert ourselves into gossip either, or we'll break the case of using the same IP.

          So, we'll create a special case for shadow gossip, and redefine it a bit. Instead of sending a SYN with our own endpoint and a generation of zero, we'll send a completely empty SYN (digest-wise, we still populate the cluster name and partioner, since those checks still make sense.) This won't ever normally occur in gossip, because a node always knows about and adds itself. When we see an empty SYN, we can know that the node that sent it is asking for everything we've got, and we can ACK with just that, allowing the replacement node to have whatever it needs for either the same or different IP cases.

          v3 does this.

          Show
          brandon.williams Brandon Williams added a comment - There are two distinct cases here: we replace 'ourself' with the same IP, or we replace a dead node with a new IP (ala ec2.) We can't know which one we're doing a priori , so we shadow gossip. If we're replacing the same IP, our shadow SYN will contain it, and the remote node will ACK with what we need. If we're not replacing with the same IP, there's a problem: an ACK will only contain what was present in the SYN digest list. One could argue this is the sender being naive, since it obviously knows the node that sent the SYN doesn't have some states that it does, but I think at scale this makes sense since it's possible a third node has begun gossiping with the SYN sender, too. In any case, I don't want to change that behavior at this point. The other problem is, we can't just sit around and wait for someone to send us a populated SYN either, since we're not a part of gossip and we're new. But we don't know we're new yet, and can't insert ourselves into gossip either, or we'll break the case of using the same IP. So, we'll create a special case for shadow gossip, and redefine it a bit. Instead of sending a SYN with our own endpoint and a generation of zero, we'll send a completely empty SYN (digest-wise, we still populate the cluster name and partioner, since those checks still make sense.) This won't ever normally occur in gossip, because a node always knows about and adds itself. When we see an empty SYN, we can know that the node that sent it is asking for everything we've got, and we can ACK with just that, allowing the replacement node to have whatever it needs for either the same or different IP cases. v3 does this.
          Hide
          thobbs Tyler Hobbs added a comment -

          That strategy sounds good to me in principle.

          I'm seeing a few problems when testing, though.

          If I start node4 with replace_address=node3 (while node3 is either up or down), I get an NPE:

          DEBUG 14:01:33,359 Node /127.0.0.4 state normal, token [6564349027099416762]
           INFO 14:01:33,362 Node /127.0.0.4 state jump to normal
          ERROR 14:01:33,363 Exception encountered during startup
          java.lang.NullPointerException
          	at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682)
          	at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694)
          	at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382)
          	at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250)
          	at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973)
          	at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187)
          	at org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214)
          	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824)
          	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
          	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
          	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
          	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
          	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
          java.lang.NullPointerException
          	at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682)
          	at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694)
          	at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382)
          	at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250)
          	at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973)
          	at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187)
          	at org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214)
          	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824)
          	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
          	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
          	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
          	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
          	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
          Exception encountered during startup: null
          ERROR 14:01:33,368 Exception in thread Thread[StorageServiceShutdownHook,5,main]
          java.lang.NullPointerException
          	at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
          	at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
          	at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
          	at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:549)
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
          	at java.lang.Thread.run(Thread.java:724)
          

          If I do replace_address with a non-existent node, after the ring delay sleep, I'll see:

          java.lang.RuntimeException: Unable to gossip with any seeds
          

          which is misleading, as that's not the actual problem. Perhaps we should explicitly check for presence of the address to replace?

          I've also seen that the node to replace can be the seed selected to gossip with, which results in this:

           INFO 14:12:58,298 Gathering node replacement information for /127.0.0.3
           INFO 14:12:58,302 Starting Messaging Service on port 7000
          DEBUG 14:12:58,316 attempting to connect to /127.0.0.3
          ERROR 14:13:29,320 Exception encountered during startup
          java.lang.RuntimeException: Unable to gossip with any seeds
          	at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1123)
          	at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:396)
          	at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:603)
          	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584)
          	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481)
          	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
          	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
          	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
          
          Show
          thobbs Tyler Hobbs added a comment - That strategy sounds good to me in principle. I'm seeing a few problems when testing, though. If I start node4 with replace_address=node3 (while node3 is either up or down), I get an NPE: DEBUG 14:01:33,359 Node /127.0.0.4 state normal, token [6564349027099416762] INFO 14:01:33,362 Node /127.0.0.4 state jump to normal ERROR 14:01:33,363 Exception encountered during startup java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682) at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973) at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187) at org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.usesHostId(Gossiper.java:682) at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:694) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1382) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1250) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:973) at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1187) at org.apache.cassandra.service.StorageService.setTokens(StorageService.java:214) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:824) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) Exception encountered during startup: null ERROR 14:01:33,368 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:549) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:724) If I do replace_address with a non-existent node, after the ring delay sleep, I'll see: java.lang.RuntimeException: Unable to gossip with any seeds which is misleading, as that's not the actual problem. Perhaps we should explicitly check for presence of the address to replace? I've also seen that the node to replace can be the seed selected to gossip with, which results in this: INFO 14:12:58,298 Gathering node replacement information for /127.0.0.3 INFO 14:12:58,302 Starting Messaging Service on port 7000 DEBUG 14:12:58,316 attempting to connect to /127.0.0.3 ERROR 14:13:29,320 Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1123) at org.apache.cassandra.service.StorageService.prepareReplacementInfo(StorageService.java:396) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:603) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:584) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:481) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
          Hide
          brandon.williams Brandon Williams added a comment - - edited

          v4 fixes the NPE and throws when autobootstrap is disabled. The second issue wasn't because of the replace_address, but because of checks in sendGossip. v4 just manually sends the message to all seeds. Depending on how many seeds you had, that may also fix the last issue (if the node being replaced is the only seed, obviously that can't work.)

          Show
          brandon.williams Brandon Williams added a comment - - edited v4 fixes the NPE and throws when autobootstrap is disabled. The second issue wasn't because of the replace_address, but because of checks in sendGossip. v4 just manually sends the message to all seeds. Depending on how many seeds you had, that may also fix the last issue (if the node being replaced is the only seed, obviously that can't work.)
          Hide
          thobbs Tyler Hobbs added a comment -

          Minor nitpick: you're missing a space before "because" in:

          throw new RuntimeException("Cannot replace_address " + DatabaseDescriptor.getReplaceAddress() + "because it doesn't exist in gossip");
          

          Other than that, +1

          Show
          thobbs Tyler Hobbs added a comment - Minor nitpick: you're missing a space before "because" in: throw new RuntimeException("Cannot replace_address " + DatabaseDescriptor.getReplaceAddress() + "because it doesn't exist in gossip"); Other than that, +1
          Hide
          brandon.williams Brandon Williams added a comment -

          Committed. I will note for ops folks, you can use replace_address in a mixed minor version 1.2 cluster, as long as one seed is also upgraded. If no seeds are upgraded there will be no harm, the replace will simply fail.

          Show
          brandon.williams Brandon Williams added a comment - Committed. I will note for ops folks, you can use replace_address in a mixed minor version 1.2 cluster, as long as one seed is also upgraded. If no seeds are upgraded there will be no harm, the replace will simply fail.

            People

            • Assignee:
              brandon.williams Brandon Williams
              Reporter:
              brandon.williams Brandon Williams
              Reviewer:
              Tyler Hobbs
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development