Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.5
    • Component/s: None
    • Labels:
      None

      Description

      add JMX command to tell a node to decommission itself (moving its data to the next node on the ring)

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Cassandra #257 (See http://hudson.zones.apache.org/hudson/job/Cassandra/257/)
          tokenMetadata should be updated before removing obsolete pending ranges, and unset bootstrapped flag when decommission is complete. patch by Jaakko Laine and jbellis for

          Show
          Hudson added a comment - Integrated in Cassandra #257 (See http://hudson.zones.apache.org/hudson/job/Cassandra/257/ ) tokenMetadata should be updated before removing obsolete pending ranges, and unset bootstrapped flag when decommission is complete. patch by Jaakko Laine and jbellis for
          Hide
          Jonathan Ellis added a comment -

          committed 05 and bootstrap flag fix

          Show
          Jonathan Ellis added a comment - committed 05 and bootstrap flag fix
          Hide
          Jonathan Ellis added a comment -

          > When node is decommisioned, should it delete its saved token? If the same node is brought back online without manual data deletion, it will enter the ring without proper bootstrap.

          You're right. So we should delete its "bootstrapped" flag. Leaving the token is harmless and may be useful to someone.

          Show
          Jonathan Ellis added a comment - > When node is decommisioned, should it delete its saved token? If the same node is brought back online without manual data deletion, it will enter the ring without proper bootstrap. You're right. So we should delete its "bootstrapped" flag. Leaving the token is harmless and may be useful to someone.
          Hide
          Hudson added a comment -

          Integrated in Cassandra #256 (See http://hudson.zones.apache.org/hudson/job/Cassandra/256/)
          simplify getChangedRangesForLeaving. patch by Jaakko Laine; reviewed by jbellis for
          add leaving mode
          patch by jbellis; reviewed by Jaakko Laine for
          move more generic streaming code into Streaming.java
          patch by jbellis; reviewed by Jaakko Laine for
          clean up transfer code from BMVH; move to Streaming.java
          patch by jbellis; reviewed by Jaakko Laine for

          Show
          Hudson added a comment - Integrated in Cassandra #256 (See http://hudson.zones.apache.org/hudson/job/Cassandra/256/ ) simplify getChangedRangesForLeaving. patch by Jaakko Laine; reviewed by jbellis for add leaving mode patch by jbellis; reviewed by Jaakko Laine for move more generic streaming code into Streaming.java patch by jbellis; reviewed by Jaakko Laine for clean up transfer code from BMVH; move to Streaming.java patch by jbellis; reviewed by Jaakko Laine for
          Hide
          Jaakko Laine added a comment -

          When node is decommisioned, should it delete its saved token? If the same node is brought back online without manual data deletion, it will enter the ring without proper bootstrap.

          Show
          Jaakko Laine added a comment - When node is decommisioned, should it delete its saved token? If the same node is brought back online without manual data deletion, it will enter the ring without proper bootstrap.
          Hide
          Jaakko Laine added a comment -

          0005
          Patch name says it all. tokenMetadata should be updated before removing obsolete pending ranges

          Show
          Jaakko Laine added a comment - 0005 Patch name says it all. tokenMetadata should be updated before removing obsolete pending ranges
          Hide
          Jonathan Ellis added a comment -

          committed w/ above changes

          Show
          Jonathan Ellis added a comment - committed w/ above changes
          Hide
          Jonathan Ellis added a comment -

          1) simplest thing is to leave the node in gossip, but remove it from tokenmetadata. this means gossip can tell nodes that were down temporarily about the new state, as designed, w/o nasty hacks. if/when someone actually needs to add/remove so many nodes that clutter of internal Gossiper state becomes a problem, he can patch it.

          2) added ARS.removeObsoletePendingRanges that idempotently removes pending ranges that are no longer needed because the ring has updated to reflect the pending state change. this approach reduces the chance of accidentally removing something we shouldn't.

          3) applied 04 patch

          Show
          Jonathan Ellis added a comment - 1) simplest thing is to leave the node in gossip, but remove it from tokenmetadata. this means gossip can tell nodes that were down temporarily about the new state, as designed, w/o nasty hacks. if/when someone actually needs to add/remove so many nodes that clutter of internal Gossiper state becomes a problem, he can patch it. 2) added ARS.removeObsoletePendingRanges that idempotently removes pending ranges that are no longer needed because the ring has updated to reflect the pending state change. this approach reduces the chance of accidentally removing something we shouldn't. 3) applied 04 patch
          Hide
          Jonathan Ellis added a comment -

          1) LEAVING/NORMAL don't get re-broadcast though, so they remain part of the app state but they won't ever be seen again by onChange. So as long as LEFT isn't part of the same gossip as leaving (and the sleep ensures it won't) in practice LEFT is already the last thing gossipped (and we shut down the gossiper right afterwards). This works as planned in my testing. I'm fine with patches improving this but it's not strictly necessary.

          2) No, the pending ranges aren't to say "i'm busy streaming" but "when the planned ring changes finish, this will be my range, so I need writes for that range in the meantime so I'm not out of date when the change-over happens." So the right place to remove those is in the STATE_LEFT change. I will add it there.

          I will look at the 04 patch shortly.

          Show
          Jonathan Ellis added a comment - 1) LEAVING/NORMAL don't get re-broadcast though, so they remain part of the app state but they won't ever be seen again by onChange. So as long as LEFT isn't part of the same gossip as leaving (and the sleep ensures it won't) in practice LEFT is already the last thing gossipped (and we shut down the gossiper right afterwards). This works as planned in my testing. I'm fine with patches improving this but it's not strictly necessary. 2) No, the pending ranges aren't to say "i'm busy streaming" but "when the planned ring changes finish, this will be my range, so I need writes for that range in the meantime so I'm not out of date when the change-over happens." So the right place to remove those is in the STATE_LEFT change. I will add it there. I will look at the 04 patch shortly.
          Hide
          Jaakko Laine added a comment -

          I think there are two problems with this patchset:

          (1) After the leaving node gossips STATE_LEFT, its gossiper continues to broadcast application state, which still includes NORMAL (and LEAVING) from previous states. When other nodes get STATE_LEFT, they remove all information about this node, which will cause them to interpret the following gossip message as new join. Don't know if stopping the gossiper from broadcasting anything more after STATE_LEFT, would cause other nodes to convict it. This would be the simplest solution, but probably need to do something more creative to better handle all possible state transitions.

          (2) When all data for a pending range has been streamed, the receiving node should inform about the completion, otherwise pending ranges won't be removed. The simplest way would be to just gossip its token again, as this will cause pending ranges for this endpoint to be removed. Problem is the receiving end does not know when whole data transfer is complete as tables are streamed one by one. Another possibility would be for the leaving node to gossip these endpoints as a part of its STATE_LEFT message.

          Show
          Jaakko Laine added a comment - I think there are two problems with this patchset: (1) After the leaving node gossips STATE_LEFT, its gossiper continues to broadcast application state, which still includes NORMAL (and LEAVING) from previous states. When other nodes get STATE_LEFT, they remove all information about this node, which will cause them to interpret the following gossip message as new join. Don't know if stopping the gossiper from broadcasting anything more after STATE_LEFT, would cause other nodes to convict it. This would be the simplest solution, but probably need to do something more creative to better handle all possible state transitions. (2) When all data for a pending range has been streamed, the receiving node should inform about the completion, otherwise pending ranges won't be removed. The simplest way would be to just gossip its token again, as this will cause pending ranges for this endpoint to be removed. Problem is the receiving end does not know when whole data transfer is complete as tables are streamed one by one. Another possibility would be for the leaving node to gossip these endpoints as a part of its STATE_LEFT message.
          Hide
          Jaakko Laine added a comment -

          Patch 0004 modifies getChangedRangesForLeaving. IMHO this way is simpler and more effective, but if other people feel otherwise, this patch may as well be left out.

          Show
          Jaakko Laine added a comment - Patch 0004 modifies getChangedRangesForLeaving. IMHO this way is simpler and more effective, but if other people feel otherwise, this patch may as well be left out.
          Hide
          Jonathan Ellis added a comment -

          rewrote, finished up step 4.

          Show
          Jonathan Ellis added a comment - rewrote, finished up step 4.
          Hide
          Jonathan Ellis added a comment -

          Started implementing and I see what you mean now – we don't want to bother sending a range that a node already has because it's already a replica for that range.

          Show
          Jonathan Ellis added a comment - Started implementing and I see what you mean now – we don't want to bother sending a range that a node already has because it's already a replica for that range.
          Hide
          Jonathan Ellis added a comment -

          you're right, the code I posted only handles primary endpoints for the ranges the leaving node handles. (but we still don't need to make it as complicated as comparing sets of ranges/endpoints)

          Show
          Jonathan Ellis added a comment - you're right, the code I posted only handles primary endpoints for the ranges the leaving node handles. (but we still don't need to make it as complicated as comparing sets of ranges/endpoints)
          Hide
          Jaakko Laine added a comment - - edited

          I think updateLeavingRanges should do:

          (1) get all ranges the leaving node is currently storing
          (2) get current natural endpoints for those ranges
          (3) get natural endpoints for these ranges when leaving node is removed
          (4) compare these two lists and add pending ranges for all nodes that are new in the lists (that is, taking responsibility for these ranges now that one node is leaving).

          I think we need to do this through replication strategy as by simply looking at token list we cannot deduce what other constraints need to be satisfied. The replica list might change by more than one node if rack awareness (or other external considerations) are taken into account.

          Show
          Jaakko Laine added a comment - - edited I think updateLeavingRanges should do: (1) get all ranges the leaving node is currently storing (2) get current natural endpoints for those ranges (3) get natural endpoints for these ranges when leaving node is removed (4) compare these two lists and add pending ranges for all nodes that are new in the lists (that is, taking responsibility for these ranges now that one node is leaving). I think we need to do this through replication strategy as by simply looking at token list we cannot deduce what other constraints need to be satisfied. The replica list might change by more than one node if rack awareness (or other external considerations) are taken into account.
          Hide
          Jonathan Ellis added a comment -

          these patches get us up to but not including the "gossip STATE_LEFT and other nodes remove it" part.

          Show
          Jonathan Ellis added a comment - these patches get us up to but not including the "gossip STATE_LEFT and other nodes remove it" part.
          Hide
          Jonathan Ellis added a comment -

          patch to add updateLeavingRanges

          Show
          Jonathan Ellis added a comment - patch to add updateLeavingRanges
          Hide
          Jonathan Ellis added a comment -

          we want the separate state so we can support removing nodes from the ring entirely, as well as load balancing. for some reason people keep asking for this and it's easy enough to support given that we're doing LB anyway.

          Show
          Jonathan Ellis added a comment - we want the separate state so we can support removing nodes from the ring entirely, as well as load balancing. for some reason people keep asking for this and it's easy enough to support given that we're doing LB anyway.
          Hide
          Jaakko Laine added a comment -

          I think that should work. I suppose STATE_LEFT will cause the token to be removed from token ring? Is STATE_LEFT necessary or can the node directly start to bootstrap to the new location after it has gossiped LEAVING and streamed all data?

          Show
          Jaakko Laine added a comment - I think that should work. I suppose STATE_LEFT will cause the token to be removed from token ring? Is STATE_LEFT necessary or can the node directly start to bootstrap to the new location after it has gossiped LEAVING and streamed all data?
          Hide
          Jonathan Ellis added a comment -

          With 525 done I think the rest looks like this

          • unbootstrapping node gossips STATE_LEAVING: token
          • other nodes use that to set pending ranges on the nodes that will be responsible for the replica ranges it has, similar to UpdateBoostrapRanges (but you can't just call strategy.getPendingRanges since we're adding ranges to existing nodes rather than a new one – can we just use strategy.getAddressRanges and figure out which node will get each Range?)
          • stream over the data
          • gossip STATE_LEFT. other nodes remove it from the gossip network (this automatically makes new node ranges look like what they would have been w/ pendingranges merged). probably worth looking at old MemberShipcleaner code that was r/m'd in r828130.
          Show
          Jonathan Ellis added a comment - With 525 done I think the rest looks like this unbootstrapping node gossips STATE_LEAVING: token other nodes use that to set pending ranges on the nodes that will be responsible for the replica ranges it has, similar to UpdateBoostrapRanges (but you can't just call strategy.getPendingRanges since we're adding ranges to existing nodes rather than a new one – can we just use strategy.getAddressRanges and figure out which node will get each Range?) stream over the data gossip STATE_LEFT. other nodes remove it from the gossip network (this automatically makes new node ranges look like what they would have been w/ pendingranges merged). probably worth looking at old MemberShipcleaner code that was r/m'd in r828130.
          Hide
          Jonathan Ellis added a comment -

          Here is what we need to do.

          TokenMetadata currently uses a slightly fragile combination of a Set of nodes known to be bootstrapping, and a Map of their tokens -> inetaddress, to determine the ranges Bootstrap nodes are concerned with (for CASSANDRA-497).

          We need to change that to instead have a Map of Range -> inetaddress, representing "these are ranges that the given node doesn't own yet, but will, so send updates in that range there as well as its current destinations."

          That allows us to use the same structure for bootstrap (new node X gets these ranges, where before it had none) and unbootstrap (existing node Y gets X's ranges, as well as its existing ones).

          Then actually implementing unbootstrap is just wiring up the streaming from one node to another. See comments in the header of BootStrapper.java for the different moving parts involved in bootstrap; the process is basically the same. (Even to requiring anticompaction to split out the different replica ranges.)

          Show
          Jonathan Ellis added a comment - Here is what we need to do. TokenMetadata currently uses a slightly fragile combination of a Set of nodes known to be bootstrapping, and a Map of their tokens -> inetaddress, to determine the ranges Bootstrap nodes are concerned with (for CASSANDRA-497 ). We need to change that to instead have a Map of Range -> inetaddress, representing "these are ranges that the given node doesn't own yet, but will, so send updates in that range there as well as its current destinations." That allows us to use the same structure for bootstrap (new node X gets these ranges, where before it had none) and unbootstrap (existing node Y gets X's ranges, as well as its existing ones). Then actually implementing unbootstrap is just wiring up the streaming from one node to another. See comments in the header of BootStrapper.java for the different moving parts involved in bootstrap; the process is basically the same. (Even to requiring anticompaction to split out the different replica ranges.)
          Hide
          Sandeep Tata added a comment -

          I think it is exactly as complicated as bootstrap.

          Show
          Sandeep Tata added a comment - I think it is exactly as complicated as bootstrap.
          Hide
          Jonathan Ellis added a comment -

          note that this is actually simpler than bootstrap, since we don't have to do anticompaction first – just move all data, and we're done

          the tricky part is not serving invalid data while this happens. see CASSANDRA-397 – I don't think piling special cases into the token ring is the way to go, but Sandeep may disagree.

          Show
          Jonathan Ellis added a comment - note that this is actually simpler than bootstrap, since we don't have to do anticompaction first – just move all data, and we're done the tricky part is not serving invalid data while this happens. see CASSANDRA-397 – I don't think piling special cases into the token ring is the way to go, but Sandeep may disagree.

            People

            • Assignee:
              Jonathan Ellis
              Reporter:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development