Details

      Description

      My reading (a while ago) of the code indicates that there is no logic involved during bootstrapping that avoids consistency level violations. If I recall correctly it just grabs neighbors that are currently up.

      There are at least two issues I have with this behavior:

      • If I have a cluster where I have applications relying on QUORUM with RF=3, and bootstrapping complete based on only one node, I have just violated the supposedly guaranteed consistency semantics of the cluster.
      • Nodes can flap up and down at any time, so even if a human takes care to look at which nodes are up and things about it carefully before bootstrapping, there's no guarantee.

      A complication is that not only does it depend on use-case where this is an issue (if all you ever do you do at CL.ONE, it's fine); even in a cluster which is otherwise used for QUORUM operations you may wish to accept less-than-quorum nodes during bootstrap in various emergency situations.

      A potential easy fix is to have bootstrap take an argument which is the number of hosts to bootstrap from, or to assume QUORUM if none is given.

      (A related concern is bootstrapping across data centers. You may want to bootstrap to a local node and then do a repair to avoid sending loads of data across DC:s while still achieving consistency. Or even if you don't care about the consistency issues, I don't think there is currently a way to bootstrap from local nodes only.)

      Thoughts?

      1. 2434-3.patch.txt
        14 kB
        paul cannon
      2. 2434-testery.patch.txt
        2 kB
        paul cannon

        Issue Links

          Activity

          Hide
          jbellis Jonathan Ellis added a comment -

          ISTM the easiest fix is to always stream from the node that will be removed from the replicas for each range, unless given permission from the operator to choose a replica that is closer / less dead.

          Show
          jbellis Jonathan Ellis added a comment - ISTM the easiest fix is to always stream from the node that will be removed from the replicas for each range, unless given permission from the operator to choose a replica that is closer / less dead.
          Hide
          jbellis Jonathan Ellis added a comment -

          Related: CASSANDRA-833

          Show
          jbellis Jonathan Ellis added a comment - Related: CASSANDRA-833
          Hide
          thepaul paul cannon added a comment -

          So, it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it (other nodes being bootstrapped/moved/decom'd during that time period); in some cases, that could still lead to a consistency violation. Is that unlikely enough that we don't care, here? At least the situation would be better with the proposed fix than it is now.

          Second question: what might the "permission from the operator to choose a replica that is closer/less dead" look like? Maybe just a boolean flag saying "it's ok to stream from any node for any range you need to stream"? Or would we want to allow specifying precise source nodes for any/all affected address ranges?

          Show
          thepaul paul cannon added a comment - So, it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it (other nodes being bootstrapped/moved/decom'd during that time period); in some cases, that could still lead to a consistency violation. Is that unlikely enough that we don't care, here? At least the situation would be better with the proposed fix than it is now. Second question: what might the "permission from the operator to choose a replica that is closer/less dead" look like? Maybe just a boolean flag saying "it's ok to stream from any node for any range you need to stream"? Or would we want to allow specifying precise source nodes for any/all affected address ranges?
          Hide
          jbellis Jonathan Ellis added a comment -

          it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it

          It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing. Does that cover what you're thinking of or are we still on the hook?

          what might the "permission from the operator to choose a replica that is closer/less dead" look like?

          It seems to me that the two valid choices are

          • Stream from "correct" replica
          • Stream from closest replica

          I can't think of a reason to stream from an arbitrary replica other than those options.

          Show
          jbellis Jonathan Ellis added a comment - it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing. Does that cover what you're thinking of or are we still on the hook? what might the "permission from the operator to choose a replica that is closer/less dead" look like? It seems to me that the two valid choices are Stream from "correct" replica Stream from closest replica I can't think of a reason to stream from an arbitrary replica other than those options.
          Hide
          thepaul paul cannon added a comment -

          It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing. Does that cover what you're thinking of or are we still on the hook?

          Is it also unsupported to decom within X's token arc, or move into/out of that arc, while X is bootstrapping? I think we're safe if so.

          Show
          thepaul paul cannon added a comment - It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing. Does that cover what you're thinking of or are we still on the hook? Is it also unsupported to decom within X's token arc, or move into/out of that arc, while X is bootstrapping? I think we're safe if so.
          Hide
          jbellis Jonathan Ellis added a comment -

          Yes.

          Show
          jbellis Jonathan Ellis added a comment - Yes.
          Hide
          thepaul paul cannon added a comment -

          This still needs some testing, but I'm putting it up now in case anyone has some time to take a look and make sure my approach is sane.

          Adds an optional "-n" argument to "nodetool join" to allow bootstrapping from closest live node (n == non-strict). Also recognizes an optional property "cassandra.join_strict" which can be set to false when a bootstrap is triggered by cassandra.join_ring.

          Show
          thepaul paul cannon added a comment - This still needs some testing, but I'm putting it up now in case anyone has some time to take a look and make sure my approach is sane. Adds an optional "-n" argument to "nodetool join" to allow bootstrapping from closest live node (n == non-strict). Also recognizes an optional property "cassandra.join_strict" which can be set to false when a bootstrap is triggered by cassandra.join_ring.
          Hide
          nickmbailey Nick Bailey added a comment -

          Seems to me like the option to stream from the closest replica might just add more confusion without really gaining anything. The node that is leaving the replica set will never be in another datacenter. It could be on a different rack, but if you are following best practices and alternating racks then it is likely either on the same rack or there is only one copy on that rack and the best case possible is streaming from another rack anyway.

          Show
          nickmbailey Nick Bailey added a comment - Seems to me like the option to stream from the closest replica might just add more confusion without really gaining anything. The node that is leaving the replica set will never be in another datacenter. It could be on a different rack, but if you are following best practices and alternating racks then it is likely either on the same rack or there is only one copy on that rack and the best case possible is streaming from another rack anyway.
          Hide
          thepaul paul cannon added a comment -

          Judging by the irc channel and user list, assuming people will follow best practices seems a bit of a dead end. Plus, what about the case where the node leaving the replica set is dead? You still want the option to allow choosing another to stream from. And we probably shouldn't default to choosing another without explicit permission, because of the consistency violation stuff.

          Show
          thepaul paul cannon added a comment - Judging by the irc channel and user list, assuming people will follow best practices seems a bit of a dead end. Plus, what about the case where the node leaving the replica set is dead? You still want the option to allow choosing another to stream from. And we probably shouldn't default to choosing another without explicit permission, because of the consistency violation stuff.
          Hide
          jbellis Jonathan Ellis added a comment -

          The node that is leaving the replica set will never be in another datacenter

          I think that's only strictly true for NTS, but I'm fine leaving it out. Not worth adding complexity for ONTS at this point.

          Show
          jbellis Jonathan Ellis added a comment - The node that is leaving the replica set will never be in another datacenter I think that's only strictly true for NTS, but I'm fine leaving it out. Not worth adding complexity for ONTS at this point.
          Hide
          jbellis Jonathan Ellis added a comment -

          what about the case where the node leaving the replica set is dead

          Good point. We do need something to make that possible.

          Show
          jbellis Jonathan Ellis added a comment - what about the case where the node leaving the replica set is dead Good point. We do need something to make that possible.
          Hide
          nickmbailey Nick Bailey added a comment -

          Yeah I was assuming ONTS is basically deprecated at this point. Didn't think about the dead case though. I suppose just a 'force' type of option and a warning indicating the possible consistency issues works.

          Show
          nickmbailey Nick Bailey added a comment - Yeah I was assuming ONTS is basically deprecated at this point. Didn't think about the dead case though. I suppose just a 'force' type of option and a warning indicating the possible consistency issues works.
          Hide
          jbellis Jonathan Ellis added a comment -

          As long as we need to handle the dead case I don't see any harm in having a slightly more generally-useful "use closest" option instead of "force to pick random live replica" option.

          Show
          jbellis Jonathan Ellis added a comment - As long as we need to handle the dead case I don't see any harm in having a slightly more generally-useful "use closest" option instead of "force to pick random live replica" option.
          Hide
          nickmbailey Nick Bailey added a comment -

          Well, I imagine the 'force' option would pick the nearest live node. By 'force' I mean the option should be posed to the user as "We can't guarantee consistency in your cluster after this bootstrap since a node is down, if you would like to do this anyway, specify option X". Just saying you can either bootstrap or bootstrap from the closest node doesn't convey the implications as well I don't think.

          Maybe we are on the same page and arguing over wording though.

          Show
          nickmbailey Nick Bailey added a comment - Well, I imagine the 'force' option would pick the nearest live node. By 'force' I mean the option should be posed to the user as "We can't guarantee consistency in your cluster after this bootstrap since a node is down, if you would like to do this anyway, specify option X". Just saying you can either bootstrap or bootstrap from the closest node doesn't convey the implications as well I don't think. Maybe we are on the same page and arguing over wording though.
          Hide
          nickmbailey Nick Bailey added a comment -

          Just as initial feedback, I'm not sure we need a new getRangesWithSources method, especially with so much duplication between them. Seems like strict could be passed to the current method. Also, what about leaving getRangesWithSources how it is and passing strict to getWorkMap? That method can do the endpoint set math if it needs to and throw a more informative exception in the case that strict is set and the endpoint we want to fetch from is dead.

          Show
          nickmbailey Nick Bailey added a comment - Just as initial feedback, I'm not sure we need a new getRangesWithSources method, especially with so much duplication between them. Seems like strict could be passed to the current method. Also, what about leaving getRangesWithSources how it is and passing strict to getWorkMap? That method can do the endpoint set math if it needs to and throw a more informative exception in the case that strict is set and the endpoint we want to fetch from is dead.
          Hide
          thepaul paul cannon added a comment -

          I did that (passing strict to getWorkMap) at first, but it wasn't too clean since it required adding a 'table' argument as well as 'strict', and it ended up replacing too much of the getRangesWithSources functionality. So then I added 'strict' as a parameter to getRangesWithSources (actually, it looks like I neglected to update its javadoc comment), but the differences between getRangesWithSources and getRangesWithStrictSource are such that a combined method feels a lot more special-casey and clunky. I like this way best in the end.

          Show
          thepaul paul cannon added a comment - I did that (passing strict to getWorkMap) at first, but it wasn't too clean since it required adding a 'table' argument as well as 'strict', and it ended up replacing too much of the getRangesWithSources functionality. So then I added 'strict' as a parameter to getRangesWithSources (actually, it looks like I neglected to update its javadoc comment), but the differences between getRangesWithSources and getRangesWithStrictSource are such that a combined method feels a lot more special-casey and clunky. I like this way best in the end.
          Hide
          nickmbailey Nick Bailey added a comment -

          Well I guess it kind of depends on which approach we take as well. Is the option A) bootstrap from the right token or bootstrap from the closest token, or B) bootstrap from the right token, but if that one isn't up, bootstrap from any other token preferring the closer ones.

          Like I said, I'd say B, but if you and Jonathan both disagree.

          Show
          nickmbailey Nick Bailey added a comment - Well I guess it kind of depends on which approach we take as well. Is the option A) bootstrap from the right token or bootstrap from the closest token, or B) bootstrap from the right token, but if that one isn't up, bootstrap from any other token preferring the closer ones. Like I said, I'd say B, but if you and Jonathan both disagree.
          Hide
          thepaul paul cannon added a comment -

          Yeah. B is probably easier on everyone, but I would say we simply can't do anything that might violate the consistency guarantee without explicit permission from the user.

          Show
          thepaul paul cannon added a comment - Yeah. B is probably easier on everyone, but I would say we simply can't do anything that might violate the consistency guarantee without explicit permission from the user.
          Hide
          nickmbailey Nick Bailey added a comment -

          Ok, so if we always prefer to bootstrap from the correct token, then I still think we should combine getRangesWithStrictSource and getRangesWithSources. Basically the logic should be, find the 'best' node to stream from. If the user requested it, also find a list of other candidates and order them by proximity. Right?

          Show
          nickmbailey Nick Bailey added a comment - Ok, so if we always prefer to bootstrap from the correct token, then I still think we should combine getRangesWithStrictSource and getRangesWithSources. Basically the logic should be, find the 'best' node to stream from. If the user requested it, also find a list of other candidates and order them by proximity. Right?
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm okay with either A or B.

          I would say we simply can't do anything that might violate the consistency guarantee without explicit permission from the user

          I'm not sure I understand, are you saying that B would violate this, or just that the status quo does?

          Show
          jbellis Jonathan Ellis added a comment - I'm okay with either A or B. I would say we simply can't do anything that might violate the consistency guarantee without explicit permission from the user I'm not sure I understand, are you saying that B would violate this, or just that the status quo does?
          Hide
          hanzhu Zhu Han added a comment -

          Is it possible to make the node does not reply to any request before bootstrap and anti-entrophy repair is finished?

          This could fix the consistency problem brought by bootstrap.

          Show
          hanzhu Zhu Han added a comment - Is it possible to make the node does not reply to any request before bootstrap and anti-entrophy repair is finished? This could fix the consistency problem brought by bootstrap.
          Hide
          jbellis Jonathan Ellis added a comment -

          Repair is a much, much more heavyweight solution to the problem than just "stream from the node that is 'displaced.'"

          Show
          jbellis Jonathan Ellis added a comment - Repair is a much, much more heavyweight solution to the problem than just "stream from the node that is 'displaced.'"
          Hide
          hanzhu Zhu Han added a comment -

          Sometimes, the node is replaced because the hardware is crashed. If so, "streaming from the node being replaced" is not available.

          How about force the repair happens if the user specifies he needs the consistency of quorum while the original node has gone.

          Show
          hanzhu Zhu Han added a comment - Sometimes, the node is replaced because the hardware is crashed. If so, "streaming from the node being replaced" is not available. How about force the repair happens if the user specifies he needs the consistency of quorum while the original node has gone.
          Hide
          thepaul paul cannon added a comment -

          Ok, so if we always prefer to bootstrap from the correct token, then I still think we should combine getRangesWithStrictSource and getRangesWithSources. Basically the logic should be, find the 'best' node to stream from. If the user requested it, also find a list of other candidates and order them by proximity. Right?

          I don't think so. I would still want to leave the option to stream from the closest even if the strict best node is available.

          Show
          thepaul paul cannon added a comment - Ok, so if we always prefer to bootstrap from the correct token, then I still think we should combine getRangesWithStrictSource and getRangesWithSources. Basically the logic should be, find the 'best' node to stream from. If the user requested it, also find a list of other candidates and order them by proximity. Right? I don't think so. I would still want to leave the option to stream from the closest even if the strict best node is available.
          Hide
          thepaul paul cannon added a comment -

          I'm not sure I understand, are you saying that B would violate this, or just that the status quo does?

          I'm saying B would violate this, yes. B was "bootstrap from the right token, but if that one isn't up, bootstrap from any other token preferring the closer ones", right? I'm saying we can't just automatically choose another token if the user didn't specifically say it's ok.

          Show
          thepaul paul cannon added a comment - I'm not sure I understand, are you saying that B would violate this, or just that the status quo does? I'm saying B would violate this, yes. B was "bootstrap from the right token, but if that one isn't up, bootstrap from any other token preferring the closer ones", right? I'm saying we can't just automatically choose another token if the user didn't specifically say it's ok.
          Hide
          nickmbailey Nick Bailey added a comment -

          Paul,

          The suggestion was that if the 'correct' node is down, you can force the bootstrap to complete anyway (probably from the closest node, but that is transparent to the user), but only if the 'correct' node is down. It sounds like you agree with Jonathan on the more general approach though.

          Zhu,

          Repair doesn't help in the case when you lost data due to a node going down. Also if only one node is down you should still be able to read/write at quorum and achieve consistency (assuming your replication factor is greater than 2).

          Show
          nickmbailey Nick Bailey added a comment - Paul, The suggestion was that if the 'correct' node is down, you can force the bootstrap to complete anyway (probably from the closest node, but that is transparent to the user), but only if the 'correct' node is down. It sounds like you agree with Jonathan on the more general approach though. Zhu, Repair doesn't help in the case when you lost data due to a node going down. Also if only one node is down you should still be able to read/write at quorum and achieve consistency (assuming your replication factor is greater than 2).
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm saying we can't just automatically choose another token if the user didn't specifically say it's ok.

          Oh, ok. Right. (I thought we were just bikeshedding over whether to call the "manual override" option "use closest" or "force bootstrap.")

          Repair doesn't help in the case when you lost data due to a node going down

          Additionally, I don't like the idea of automatically doing expensive things like repair; it feels cleaner to not do it automatically, and allow using the existing tool to perform one if desired, than to do it by default and have to add an option to skip it for when that's not desirable.

          Show
          jbellis Jonathan Ellis added a comment - I'm saying we can't just automatically choose another token if the user didn't specifically say it's ok. Oh, ok. Right. (I thought we were just bikeshedding over whether to call the "manual override" option "use closest" or "force bootstrap.") Repair doesn't help in the case when you lost data due to a node going down Additionally, I don't like the idea of automatically doing expensive things like repair; it feels cleaner to not do it automatically, and allow using the existing tool to perform one if desired, than to do it by default and have to add an option to skip it for when that's not desirable.
          Hide
          hanzhu Zhu Han added a comment -

          Also if only one node is down you should still be able to read/write at quorum and achieve consistency

          I suppose quorum read plus quorum write should provide monotonic read consistency. [1] Supposing quorum write on key1 hits node A and node B, not on node C due to temporal network partition. After that node B is replaced by node D since it is down, and node D streams data from node C. If the following quorum read on key1 hits only node C and node D, the monotonic consistency is violated. This is rare but not unrealistic, especially when hint handoff is disabled.

          Maybe it is more resonable to give the admin an option, to specify that the bootstrapped node should not accept any read request until the admin turn it on manually. So the admin can start a manual repair if he wants to assure everything goes fine.

          [1]http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

          Show
          hanzhu Zhu Han added a comment - Also if only one node is down you should still be able to read/write at quorum and achieve consistency I suppose quorum read plus quorum write should provide monotonic read consistency. [1] Supposing quorum write on key1 hits node A and node B, not on node C due to temporal network partition. After that node B is replaced by node D since it is down, and node D streams data from node C. If the following quorum read on key1 hits only node C and node D, the monotonic consistency is violated. This is rare but not unrealistic, especially when hint handoff is disabled. Maybe it is more resonable to give the admin an option, to specify that the bootstrapped node should not accept any read request until the admin turn it on manually. So the admin can start a manual repair if he wants to assure everything goes fine. [1] http://www.allthingsdistributed.com/2007/12/eventually_consistent.html
          Hide
          thepaul paul cannon added a comment -

          The suggestion was that if the 'correct' node is down, you can force the bootstrap to complete anyway (probably from the closest node, but that is transparent to the user), but only if the 'correct' node is down.

          Oh, ok. I misunderstood. This seems reasonable. I'd lean for the more general solution, yeah, but I don't feel very strongly about it.

          Show
          thepaul paul cannon added a comment - The suggestion was that if the 'correct' node is down, you can force the bootstrap to complete anyway (probably from the closest node, but that is transparent to the user), but only if the 'correct' node is down. Oh, ok. I misunderstood. This seems reasonable. I'd lean for the more general solution, yeah, but I don't feel very strongly about it.
          Hide
          hanzhu Zhu Han added a comment -

          As peter suggested before, another approach to fix the consistency problem is streaming sstables from all alive peers if the "correct" node is down. And then leave them to normal compaction.

          This could be much lightweight than anti-entrophy repair, except the network IO pressure on the bootstrapping node.

          Show
          hanzhu Zhu Han added a comment - As peter suggested before, another approach to fix the consistency problem is streaming sstables from all alive peers if the "correct" node is down. And then leave them to normal compaction. This could be much lightweight than anti-entrophy repair, except the network IO pressure on the bootstrapping node.
          Hide
          thepaul paul cannon added a comment -

          updated patch fixes the docstring for getRangesWithStrictSource().

          Show
          thepaul paul cannon added a comment - updated patch fixes the docstring for getRangesWithStrictSource().
          Hide
          thepaul paul cannon added a comment -

          Patch 2434-testery.patch.txt adds a bit to unit tests to exercise o.a.c.dht.BootStrapper.getRangesWithStrictSource().

          Show
          thepaul paul cannon added a comment - Patch 2434-testery.patch.txt adds a bit to unit tests to exercise o.a.c.dht.BootStrapper.getRangesWithStrictSource().
          Hide
          thepaul paul cannon added a comment -

          2434-3.patch.txt removes the bits that add the "-n" option to nodetool join. Apparently no "nodetool join" should ever result in a bootstrap, so it doesn't matter whether the caller wants "strict" or not.

          Show
          thepaul paul cannon added a comment - 2434-3.patch.txt removes the bits that add the "-n" option to nodetool join. Apparently no "nodetool join" should ever result in a bootstrap, so it doesn't matter whether the caller wants "strict" or not.
          Hide
          jbellis Jonathan Ellis added a comment -

          Do we need to do anything special for move/decommission as well?

          Show
          jbellis Jonathan Ellis added a comment - Do we need to do anything special for move/decommission as well?
          Hide
          nickmbailey Nick Bailey added a comment -

          I had a note to remember to create a ticket for that, but if we want to do it here that works as well.

          In any case, yes the same concerns exist when giving away ranges as when gaining ranges.

          Show
          nickmbailey Nick Bailey added a comment - I had a note to remember to create a ticket for that, but if we want to do it here that works as well. In any case, yes the same concerns exist when giving away ranges as when gaining ranges.
          Hide
          thepaul paul cannon added a comment -

          Do we need to do anything special for move/decommission as well?

          Yes, it looks like we do need to add similar logic for move. Expand the scope of this ticket accordingly?

          I don't see any way decommission could be affected by this sort of problem.

          Show
          thepaul paul cannon added a comment - Do we need to do anything special for move/decommission as well? Yes, it looks like we do need to add similar logic for move. Expand the scope of this ticket accordingly? I don't see any way decommission could be affected by this sort of problem.
          Hide
          thepaul paul cannon added a comment -

          In any case, yes the same concerns exist when giving away ranges as when gaining ranges.

          Oh? I must be missing something. What would a consistency violation failure scenario look like for giving away ranges?

          Show
          thepaul paul cannon added a comment - In any case, yes the same concerns exist when giving away ranges as when gaining ranges. Oh? I must be missing something. What would a consistency violation failure scenario look like for giving away ranges?
          Hide
          jbellis Jonathan Ellis added a comment -

          Expand the scope of this ticket accordingly?

          Yes, let's solve them both here.

          Show
          jbellis Jonathan Ellis added a comment - Expand the scope of this ticket accordingly? Yes, let's solve them both here.
          Hide
          nickmbailey Nick Bailey added a comment -

          My comment wasn't very clear. Both decom and move currently, attempt to do the right thing. When a node is leaving, there should be one new replica for all the ranges it is responsible for. If it can't stream data to that replica there is a consistency problem.

          Both operations currently try to do stream to that replica, but we should use the 'strict' logic in those cases as well and fail if we can't guarantee consistency and the user hasn't disabled strict.

          Show
          nickmbailey Nick Bailey added a comment - My comment wasn't very clear. Both decom and move currently, attempt to do the right thing. When a node is leaving, there should be one new replica for all the ranges it is responsible for. If it can't stream data to that replica there is a consistency problem. Both operations currently try to do stream to that replica, but we should use the 'strict' logic in those cases as well and fail if we can't guarantee consistency and the user hasn't disabled strict.
          Hide
          thepaul paul cannon added a comment -

          If decom can't stream data to the appropriate replica, then it should just fail, right? Do we support decom in cases where a consistency violation would result? Seems like it has to be the user's responsibility to bring up or decom the other node first.

          move could introduce a violation when it gains a new range, though, in the same cases as the bootstrap issue explained above.

          Show
          thepaul paul cannon added a comment - If decom can't stream data to the appropriate replica, then it should just fail, right? Do we support decom in cases where a consistency violation would result? Seems like it has to be the user's responsibility to bring up or decom the other node first. move could introduce a violation when it gains a new range, though, in the same cases as the bootstrap issue explained above.
          Hide
          nickmbailey Nick Bailey added a comment -

          If we support it for bootstrapping I don't see why we shouldn't support it for decom. Right, move has the problem in both cases (giving away ranges, gaining ranges).

          Show
          nickmbailey Nick Bailey added a comment - If we support it for bootstrapping I don't see why we shouldn't support it for decom. Right, move has the problem in both cases (giving away ranges, gaining ranges).
          Hide
          thepaul paul cannon added a comment -

          I think we're talking about different things. Requiring the user to have the right nodes available for operation X is not the same as "cassandra can 'lose' writes when it happens to stream from the wrong node, even if the user did everything right".

          This ticket is about the latter, I think.

          Show
          thepaul paul cannon added a comment - I think we're talking about different things. Requiring the user to have the right nodes available for operation X is not the same as "cassandra can 'lose' writes when it happens to stream from the wrong node, even if the user did everything right". This ticket is about the latter, I think.
          Hide
          thepaul paul cannon added a comment -

          Conversation on #cassandra-dev resulted in the conclusion that we'll fix this bug for range acquisition (bootstrap and move) now, and plan to allow the same looseness (non-strict mode, or whatever) for range egress (move and decom) in the future.

          I think.

          Show
          thepaul paul cannon added a comment - Conversation on #cassandra-dev resulted in the conclusion that we'll fix this bug for range acquisition (bootstrap and move) now, and plan to allow the same looseness (non-strict mode, or whatever) for range egress (move and decom) in the future. I think.
          Hide
          jbellis Jonathan Ellis added a comment -

          It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing.

          I'm pretty sure now that this is incorrect; we fixed it back in CASSANDRA-603. I'm updating the comments in TokenMetadata as follows:

              // Prior to CASSANDRA-603, we just had <tt>Map<Range, InetAddress> pendingRanges<tt>,
              // which was added to when a node began bootstrap and removed from when it finished.
              //
              // This is inadequate when multiple changes are allowed simultaneously.  For example,
              // suppose that there is a ring of nodes A, C and E, with replication factor 3.
              // Node D bootstraps between C and E, so its pending ranges will be E-A, A-C and C-D.
              // Now suppose node B bootstraps between A and C at the same time. Its pending ranges
              // would be C-E, E-A and A-B. Now both nodes need to be assigned pending range E-A,
              // which we would be unable to represent with the old Map.  The same thing happens
              // even more obviously for any nodes that boot simultaneously between same two nodes.
              //
              // So, we made two changes:
              //
              // First, we changed pendingRanges to a <tt>Multimap<Range, InetAddress></tt> (now
              // <tt>Map<String, Multimap<Range, InetAddress>></tt>, because replication strategy
              // and options are per-KeySpace).
              //
              // Second, we added the bootstrapTokens and leavingEndpoints collections, so we can
              // rebuild pendingRanges from the complete information of what is going on, when
              // additional changes are made mid-operation.
              //
              // Finally, note that recording the tokens of joining nodes in bootstrapTokens also
              // means we can detect and reject the addition of multiple nodes at the same token
              // before one becomes part of the ring.
              private BiMap<Token, InetAddress> bootstrapTokens = HashBiMap.create();
              // (don't need to record Token here since it's still part of tokenToEndpointMap until it's done leaving)
              private Set<InetAddress> leavingEndpoints = new HashSet<InetAddress>();
              // this is a cache of the calculation from {tokenToEndpointMap, bootstrapTokens, leavingEndpoints}
              private ConcurrentMap<String, Multimap<Range, InetAddress>> pendingRanges = new ConcurrentHashMap<String, Multimap<Range, InetAddress>>();
          
          Show
          jbellis Jonathan Ellis added a comment - It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing. I'm pretty sure now that this is incorrect; we fixed it back in CASSANDRA-603 . I'm updating the comments in TokenMetadata as follows: // Prior to CASSANDRA-603, we just had <tt>Map<Range, InetAddress> pendingRanges<tt>, // which was added to when a node began bootstrap and removed from when it finished. // // This is inadequate when multiple changes are allowed simultaneously. For example, // suppose that there is a ring of nodes A, C and E, with replication factor 3. // Node D bootstraps between C and E, so its pending ranges will be E-A, A-C and C-D. // Now suppose node B bootstraps between A and C at the same time. Its pending ranges // would be C-E, E-A and A-B. Now both nodes need to be assigned pending range E-A, // which we would be unable to represent with the old Map. The same thing happens // even more obviously for any nodes that boot simultaneously between same two nodes. // // So, we made two changes: // // First, we changed pendingRanges to a <tt>Multimap<Range, InetAddress></tt> (now // <tt>Map<String, Multimap<Range, InetAddress>></tt>, because replication strategy // and options are per-KeySpace). // // Second, we added the bootstrapTokens and leavingEndpoints collections, so we can // rebuild pendingRanges from the complete information of what is going on, when // additional changes are made mid-operation. // // Finally, note that recording the tokens of joining nodes in bootstrapTokens also // means we can detect and reject the addition of multiple nodes at the same token // before one becomes part of the ring. private BiMap<Token, InetAddress> bootstrapTokens = HashBiMap.create(); // (don't need to record Token here since it's still part of tokenToEndpointMap until it's done leaving) private Set<InetAddress> leavingEndpoints = new HashSet<InetAddress>(); // this is a cache of the calculation from {tokenToEndpointMap, bootstrapTokens, leavingEndpoints} private ConcurrentMap<String, Multimap<Range, InetAddress>> pendingRanges = new ConcurrentHashMap<String, Multimap<Range, InetAddress>>();
          Hide
          thepaul paul cannon added a comment -

          I'm pretty sure now that this is incorrect;

          Well, doh. That puts us back at my first question:

          So, it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it (other nodes being bootstrapped/moved/decom'd during that time period); in some cases, that could still lead to a consistency violation. Is that unlikely enough that we don't care, here? At least the situation would be better with the proposed fix than it is now.

          Show
          thepaul paul cannon added a comment - I'm pretty sure now that this is incorrect; Well, doh. That puts us back at my first question: So, it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it (other nodes being bootstrapped/moved/decom'd during that time period); in some cases, that could still lead to a consistency violation. Is that unlikely enough that we don't care, here? At least the situation would be better with the proposed fix than it is now.
          Hide
          jbellis Jonathan Ellis added a comment -

          Can you give an example for illustration?

          Show
          jbellis Jonathan Ellis added a comment - Can you give an example for illustration?
          Hide
          nickmbailey Nick Bailey added a comment -

          Ok, so I think there are really two consistency issues here. Firstly, picking the 'right' node to stream data to/from when making changes to the ring. Secondly, disallowing concurrent changes that have overlapping ranges.

          Currently we only disallow nodes from moving/decommissioning when they may potentially have data being streamed to them. There are a few examples of things we currently allow which I think are generally a bad idea.

          1) Say you have nodes A and D, if you bootstrap nodes B and C at the same time in between A and D, it may turn out that the correct node to stream from for both nodes is D. Now say node C finishes bootstrapping before node B. At that point, the correct node for B to bootstrap from is technically C, although D still has the data. However, since D is no longer technically responsible for the data, the user could run cleanup on D and delete the data that B is attempting to stream.

          2) The above case is also a problem when you bootstrap a node and the node it decides it needs to stream from is moving. Once that node finishes moving you could run cleanup on that node and delete data that the bootstrapping node needs. In this case, all documentation indicates you should do a cleanup after a move in order to remove old data, so it seems possibly more likely.

          3) A variation of the above case is when you bootstrap a node and the node it streams from is leaving. In that case the decom may finish and the user could terminate the cassandra process and/or node breaking any streams. Not to mention the idea of a node in a decommissioned state continuing to stream seems like a bad idea. I believe it would work currently, but I'm not sure and it seems likely to break.

          I can't really think of any other examples but I think thats enough to illustrate that overlapping concurrent ring changes are a bad idea and we should just attempt to prevent them in all cases. An argument could be made that this would prevent you from doubling your cluster (the best way to grow) all at once, but I don't think that's really a huge deal. At most you would need RF steps to double your cluster.

          Show
          nickmbailey Nick Bailey added a comment - Ok, so I think there are really two consistency issues here. Firstly, picking the 'right' node to stream data to/from when making changes to the ring. Secondly, disallowing concurrent changes that have overlapping ranges. Currently we only disallow nodes from moving/decommissioning when they may potentially have data being streamed to them. There are a few examples of things we currently allow which I think are generally a bad idea. 1) Say you have nodes A and D, if you bootstrap nodes B and C at the same time in between A and D, it may turn out that the correct node to stream from for both nodes is D. Now say node C finishes bootstrapping before node B. At that point, the correct node for B to bootstrap from is technically C, although D still has the data. However, since D is no longer technically responsible for the data, the user could run cleanup on D and delete the data that B is attempting to stream. 2) The above case is also a problem when you bootstrap a node and the node it decides it needs to stream from is moving. Once that node finishes moving you could run cleanup on that node and delete data that the bootstrapping node needs. In this case, all documentation indicates you should do a cleanup after a move in order to remove old data, so it seems possibly more likely. 3) A variation of the above case is when you bootstrap a node and the node it streams from is leaving. In that case the decom may finish and the user could terminate the cassandra process and/or node breaking any streams. Not to mention the idea of a node in a decommissioned state continuing to stream seems like a bad idea. I believe it would work currently, but I'm not sure and it seems likely to break. I can't really think of any other examples but I think thats enough to illustrate that overlapping concurrent ring changes are a bad idea and we should just attempt to prevent them in all cases. An argument could be made that this would prevent you from doubling your cluster (the best way to grow) all at once, but I don't think that's really a huge deal. At most you would need RF steps to double your cluster.
          Hide
          thepaul paul cannon added a comment -

          I think we can still allow overlapping concurrent ring changes, with the right set of invariants and/or operational rules. I've been trying to work on defining those. It's pretty tricky but I think I have the right way of approaching the model now. Will update later today with more.

          Nick is right though, c* probably shouldn't support overlapping changes as is. Think bootstrapping >N nodes between the same two old nodes, decom'ing one node and bootstrapping another in such a way that the bootstrap source stream node becomes the wrong source, etc.

          Show
          thepaul paul cannon added a comment - I think we can still allow overlapping concurrent ring changes, with the right set of invariants and/or operational rules. I've been trying to work on defining those. It's pretty tricky but I think I have the right way of approaching the model now. Will update later today with more. Nick is right though, c* probably shouldn't support overlapping changes as is. Think bootstrapping >N nodes between the same two old nodes, decom'ing one node and bootstrapping another in such a way that the bootstrap source stream node becomes the wrong source, etc.
          Hide
          nickmbailey Nick Bailey added a comment -

          So the fact that the 'correct' bootstrap source switches mid stream isn't really the problem I don't think. We set up pending ranges so that when a node gets ready to bootstrap, writes start getting duplicated to both the currently correct replica, and the bootstrapping node. Since all new writes are duplicated we can stream from that node and as long as we get the entire dataset, consistency should be fine. The problem is there is nothing in place preventing someone from running cleanup or killing a decommed node or something.

          I'm doubtful that the complexity of a correct set of rules for allowing overlapping ring changes is really worth the time/effort/fragility. It doesn't seem like that much of a loss to me to disallow them. Perhaps your set of rules will be super simple though .

          Show
          nickmbailey Nick Bailey added a comment - So the fact that the 'correct' bootstrap source switches mid stream isn't really the problem I don't think. We set up pending ranges so that when a node gets ready to bootstrap, writes start getting duplicated to both the currently correct replica, and the bootstrapping node. Since all new writes are duplicated we can stream from that node and as long as we get the entire dataset, consistency should be fine. The problem is there is nothing in place preventing someone from running cleanup or killing a decommed node or something. I'm doubtful that the complexity of a correct set of rules for allowing overlapping ring changes is really worth the time/effort/fragility. It doesn't seem like that much of a loss to me to disallow them. Perhaps your set of rules will be super simple though .
          Hide
          thepaul paul cannon added a comment -

          Right, I know how the pending-ranges writes work. But there are still possible openings for consistency violations for previously-written data, similar to the one outlined in the original ticket description. If the bootstrap stream source has an outdated value and it gets duplicated to a bootstrapping node, but then the bootstrap stream source doesn't leave the replication set (because something else changed in the interim), the outdated value is now on more nodes than it used to be- possibly now QUORUM, when previously it was safely below.

          Show
          thepaul paul cannon added a comment - Right, I know how the pending-ranges writes work. But there are still possible openings for consistency violations for previously-written data, similar to the one outlined in the original ticket description. If the bootstrap stream source has an outdated value and it gets duplicated to a bootstrapping node, but then the bootstrap stream source doesn't leave the replication set (because something else changed in the interim), the outdated value is now on more nodes than it used to be- possibly now QUORUM, when previously it was safely below.
          Hide
          thepaul paul cannon added a comment -

          Ok, prospective approach to totally safe range movements:

          Operational rules:

          • Cassandra will not allow two range motion operations (move, bootstrap, decom) at the same time on the same node.
          • When a range motion operation is already pending, User should refrain from starting another range motion operation (if either motion operation overlaps the arc-of-effect of the other) until the gossip info about the first change has propagated to all affected nodes. (This is more simply approximated by the "two minute rule".)
          • Every point in the tokenspace has the same number of natural endpoints, and they're ordered the same from the perspective of all nodes (is this an ok assumption?).
          • It is User's responsibility to make sure that the right streaming source nodes are available. If they're not, the range motion operation may fail.

          Procedure:

          • For any motion involving range R, there will be a stream from endpoint EP_source to endpoint EP_dest. Given the same information about what range motion operations are pending (TokenMetadata) and the range R, there is a bijection from EP_source to EP_dest, shared by all nodes in the ring.
          • Procedure to determine EP_source from EP_dest:
            • Let REP_current be the existing (ordered) list of natural endpoints for R.
            • Let TM_future be a clone of the current TokenMetadata, but with all ongoing bootstraps, moves, and decoms resolved and completed.
            • Let REP_future be the list of (ordered) natural endpoints for R according to TM_future.
            • Let EPL_entering be the list of endpoints in REP_future which are not in REP_current (preserving their order in REP_future).
            • Let EPL_leaving be the list of endpoints in REP_current which are not in REP_future (preserving their order in REP_current).
            • EPL_entering and EPL_leaving are of the same length.
            • Let Pos be the position/index of EP_dest in EPL_entering.
            • Let EP_source be the endpoint at position Pos in EPL_leaving.
          • Intuitively, this is the same as the rule expressed earlier in this ticket (stream from the node you'll replace), but also handles other ongoing range movements in the same token arc.
          • These rules can be pretty trivially inverted to determine EP_dest from EP_source.
          • When any node gets gossip about a range motion occurring with its token arc-of-effect, it calculates (or recalculates) the streams in which it should be involved. Any ongoing streams which are no longer necessary are canceled, and any newly necessary streams are instigated.

          I tried to construct a ruleset without that last rearrange-ongoing-streams rule, but it ended up with a pretty complicated set of extra restrictions, and a more complicated set of procedures than this.

          This set of rules might look complicated, but I think it should be fairly straightforward to implement, and may even end up simpler overall than our current code.

          Note that this procedure even maintains the consistency guarantee in cases like:

          • In an RF=3 cluster with nodes A, E, and F, bootstrap B, C, and D in quick succession (E streams to B, F streams to C, A streams to D)
          • In an RF=3 cluster with nodes A, C, and E, bootstrap B, D, and F, and decommission A, C, and E, all in quick succession (A streams to B, C streams to D, E streams to F)
          • In an RF=3 cluster with nodes A, B, C, D, and E, decommission B and C in quick succession (B streams to D, C streams to E)
          Show
          thepaul paul cannon added a comment - Ok, prospective approach to totally safe range movements: Operational rules: Cassandra will not allow two range motion operations (move, bootstrap, decom) at the same time on the same node. When a range motion operation is already pending, User should refrain from starting another range motion operation (if either motion operation overlaps the arc-of-effect of the other) until the gossip info about the first change has propagated to all affected nodes. (This is more simply approximated by the "two minute rule".) Every point in the tokenspace has the same number of natural endpoints, and they're ordered the same from the perspective of all nodes (is this an ok assumption?). It is User's responsibility to make sure that the right streaming source nodes are available. If they're not, the range motion operation may fail. Procedure: For any motion involving range R , there will be a stream from endpoint EP_source to endpoint EP_dest . Given the same information about what range motion operations are pending ( TokenMetadata ) and the range R , there is a bijection from EP_source to EP_dest , shared by all nodes in the ring. Procedure to determine EP_source from EP_dest : Let REP_current be the existing (ordered) list of natural endpoints for R . Let TM_future be a clone of the current TokenMetadata , but with all ongoing bootstraps, moves, and decoms resolved and completed. Let REP_future be the list of (ordered) natural endpoints for R according to TM_future . Let EPL_entering be the list of endpoints in REP_future which are not in REP_current (preserving their order in REP_future ). Let EPL_leaving be the list of endpoints in REP_current which are not in REP_future (preserving their order in REP_current ). EPL_entering and EPL_leaving are of the same length. Let Pos be the position/index of EP_dest in EPL_entering . Let EP_source be the endpoint at position Pos in EPL_leaving . Intuitively, this is the same as the rule expressed earlier in this ticket (stream from the node you'll replace), but also handles other ongoing range movements in the same token arc. These rules can be pretty trivially inverted to determine EP_dest from EP_source . When any node gets gossip about a range motion occurring with its token arc-of-effect, it calculates (or recalculates) the streams in which it should be involved. Any ongoing streams which are no longer necessary are canceled, and any newly necessary streams are instigated. I tried to construct a ruleset without that last rearrange-ongoing-streams rule, but it ended up with a pretty complicated set of extra restrictions, and a more complicated set of procedures than this. This set of rules might look complicated, but I think it should be fairly straightforward to implement, and may even end up simpler overall than our current code. Note that this procedure even maintains the consistency guarantee in cases like: In an RF=3 cluster with nodes A, E, and F, bootstrap B, C, and D in quick succession (E streams to B, F streams to C, A streams to D) In an RF=3 cluster with nodes A, C, and E, bootstrap B, D, and F, and decommission A, C, and E, all in quick succession (A streams to B, C streams to D, E streams to F) In an RF=3 cluster with nodes A, B, C, D, and E, decommission B and C in quick succession (B streams to D, C streams to E)
          Hide
          thepaul paul cannon added a comment -

          Clarification: the cleanup operation would need to consider pending ranges in addition to the current natural range-endpoint mapping.

          It would be possible with this proposal for a node M to be streaming range S to a bootstrapping node Y, and midway, for M to stop being part of the replication set for S (perhaps some other bootstraps nearby completed first). This should be ok for consistency, but a cleanup operation on M while the stream is ongoing could potentially remove all the data in S unless this change is made.

          Further clarification: instead of the W+1 special case we now have for writing to a range T with a pending motion, we would need to write to W+C replicas instead, where C is the number of pending motions within the replication set of T.

          Show
          thepaul paul cannon added a comment - Clarification: the cleanup operation would need to consider pending ranges in addition to the current natural range-endpoint mapping. It would be possible with this proposal for a node M to be streaming range S to a bootstrapping node Y , and midway, for M to stop being part of the replication set for S (perhaps some other bootstraps nearby completed first). This should be ok for consistency, but a cleanup operation on M while the stream is ongoing could potentially remove all the data in S unless this change is made. Further clarification: instead of the W+1 special case we now have for writing to a range T with a pending motion, we would need to write to W+C replicas instead, where C is the number of pending motions within the replication set of T .
          Hide
          jbellis Jonathan Ellis added a comment -

          Coming back to this after the 1.0 scramble.

          It sounds like you have a reasonable solution here, is there any reason not to implement it for 1.1?

          Show
          jbellis Jonathan Ellis added a comment - Coming back to this after the 1.0 scramble. It sounds like you have a reasonable solution here, is there any reason not to implement it for 1.1?
          Hide
          thepaul paul cannon added a comment -

          It sounds like you have a reasonable solution here, is there any reason not to implement it for 1.1?

          Just that it's quite a bit more complex than simply disallowing overlapping ring movements, and the extra problems that come with higher complexity. I think this feature is worth it, on its own, but when i think of how much pain Brandon seems to be going through dealing with streaming code, maybe it's not.

          Show
          thepaul paul cannon added a comment - It sounds like you have a reasonable solution here, is there any reason not to implement it for 1.1? Just that it's quite a bit more complex than simply disallowing overlapping ring movements, and the extra problems that come with higher complexity. I think this feature is worth it, on its own, but when i think of how much pain Brandon seems to be going through dealing with streaming code, maybe it's not.
          Hide
          jbellis Jonathan Ellis added a comment -

          No longer very optimistic on the "may even end up simpler overall than our current code" front?

          TBH this area of the code is fragile and hairy and maybe starting from a clean slate with a real plan instead of trying to patch things in haphazardly would be a good thing.

          But, I'd be okay with re-imposing the "no overlapping moves" rule and fixing the stream source problem if that's going to be substantially simpler.

          Show
          jbellis Jonathan Ellis added a comment - No longer very optimistic on the "may even end up simpler overall than our current code" front? TBH this area of the code is fragile and hairy and maybe starting from a clean slate with a real plan instead of trying to patch things in haphazardly would be a good thing. But, I'd be okay with re-imposing the "no overlapping moves" rule and fixing the stream source problem if that's going to be substantially simpler.
          Hide
          thepaul paul cannon added a comment -

          No, I do think that if we tore out the existing code and replaced it, it would be simpler overall, but (a) that would probably also be true if we rewrote the existing code without implementing this; (b) it will be rather a lot of work; and (c) it may engender a whole new generation of subtle corner-case bugs (or maybe it will eliminate a lot of such bugs that already exist).

          Show
          thepaul paul cannon added a comment - No, I do think that if we tore out the existing code and replaced it, it would be simpler overall, but (a) that would probably also be true if we rewrote the existing code without implementing this; (b) it will be rather a lot of work; and (c) it may engender a whole new generation of subtle corner-case bugs (or maybe it will eliminate a lot of such bugs that already exist).
          Hide
          jbellis Jonathan Ellis added a comment -

          How much work would it be to add "just one more bandaid" for the stream source thing, in comparison?

          Show
          jbellis Jonathan Ellis added a comment - How much work would it be to add "just one more bandaid" for the stream source thing, in comparison?
          Hide
          thepaul paul cannon added a comment -

          Do you mean to implement the new "safe range movements" procedure outlined above, without rewriting the rest of the range movement code?

          If so, I submit a SWAG of "1/3-1/2 the cost of the rewrite option". The bandaid would still touch a lot of different moving parts.

          Show
          thepaul paul cannon added a comment - Do you mean to implement the new "safe range movements" procedure outlined above, without rewriting the rest of the range movement code? If so, I submit a SWAG of "1/3-1/2 the cost of the rewrite option". The bandaid would still touch a lot of different moving parts.
          Hide
          jbellis Jonathan Ellis added a comment -

          So either way, a substantial amount of work, but the bandaid still leaves us with rules that the operator must enforce or hit subtle problems.

          My hang-up with the bandaid is the part where C* can't effectively enforce the guidelines to be safe. Even "wait X seconds between bootstrap operations" is not a prereq I am comfortable with.

          Unless the above is incorrect, I think we should bite the bullet and fix it "right."

          Show
          jbellis Jonathan Ellis added a comment - So either way, a substantial amount of work, but the bandaid still leaves us with rules that the operator must enforce or hit subtle problems. My hang-up with the bandaid is the part where C* can't effectively enforce the guidelines to be safe. Even "wait X seconds between bootstrap operations" is not a prereq I am comfortable with. Unless the above is incorrect, I think we should bite the bullet and fix it "right."
          Hide
          thepaul paul cannon added a comment -

          So, I believe that the rules outlined above can still work without the "wait X seconds between bootstrap operations" prereq, if a pretty simple extra step is added:

          If any node learns about conflicting move operations, then some rules are applied to choose which will be honored and which will return an error to its caller (if still possible).

          Those rules are:

          • A decom for node X beats a move or bootstrap for node X
          • Two decoms for node X from coordinator nodes Y and Z: the coordinator with the higher token wins
          • Any other conflicts between move/bootstrap operations for the same node (which can arise in certain partition situations) are easily resolved by latest VersionedValue.

          This should guarantee convergence of TokenMetadata across any affected parts of a cluster.

          Show
          thepaul paul cannon added a comment - So, I believe that the rules outlined above can still work without the "wait X seconds between bootstrap operations" prereq, if a pretty simple extra step is added: If any node learns about conflicting move operations, then some rules are applied to choose which will be honored and which will return an error to its caller (if still possible). Those rules are: A decom for node X beats a move or bootstrap for node X Two decoms for node X from coordinator nodes Y and Z: the coordinator with the higher token wins Any other conflicts between move/bootstrap operations for the same node (which can arise in certain partition situations) are easily resolved by latest VersionedValue. This should guarantee convergence of TokenMetadata across any affected parts of a cluster.
          Hide
          jbellis Jonathan Ellis added a comment -

          Sounds reasonable.

          Show
          jbellis Jonathan Ellis added a comment - Sounds reasonable.
          Hide
          jbellis Jonathan Ellis added a comment -

          As a half measure, we can stream from the "right" node very easily if we continue to make the simplifying assumption that no other node movement happens in overlapping ranges during the operation.

          Show
          jbellis Jonathan Ellis added a comment - As a half measure, we can stream from the "right" node very easily if we continue to make the simplifying assumption that no other node movement happens in overlapping ranges during the operation.
          Hide
          tjake T Jake Luciani added a comment -

          I've taken a crack at this, initially for 1.2 since it solves my pain. Appreciate a review.

          As Jonathan Ellis mentions above it requires only one node to be added at a time. Also bootstrapping node must add -Dconsistent.bootstrap=true

          #code
          https://github.com/tjake/cassandra/tree/2434

          #dtest showing it works (use ENABLE_VNODES=yes)
          https://github.com/tjake/cassandra-dtest/tree/2434

          Show
          tjake T Jake Luciani added a comment - I've taken a crack at this, initially for 1.2 since it solves my pain. Appreciate a review. As Jonathan Ellis mentions above it requires only one node to be added at a time. Also bootstrapping node must add -Dconsistent.bootstrap=true #code https://github.com/tjake/cassandra/tree/2434 #dtest showing it works (use ENABLE_VNODES=yes) https://github.com/tjake/cassandra-dtest/tree/2434
          Hide
          jbellis Jonathan Ellis added a comment -

          (Tyler Hobbs to review)

          Show
          jbellis Jonathan Ellis added a comment - ( Tyler Hobbs to review)
          Hide
          thobbs Tyler Hobbs added a comment -

          Thanks, Jake.

          I strongly prefer to default to the strict/safe behavior and make the user supply a "force" option for non-strict behavior, like Nick and Paul agreed on above. If the bootstrapping node cannot stream from the correct replica and the "force" option isn't set, it should abort the bootstrap with an error that describes the implications and mentions how to use the "force" option.

          Additionally, I think your logic for picking the preferred replica could be greatly simplified. Paul's 2434-3.patch.txt has a really simple version of this and also has the strict-by-default behavior. It might be worthwhile to look at rebasing that patch as a start.

          Paul mentioned this:

          Conversation on #cassandra-dev resulted in the conclusion that we'll fix this bug for range acquisition (bootstrap and move) now, and plan to allow the same looseness (non-strict mode, or whatever) for range egress (move and decom) in the future.

          Looking at the irc logs, there wasn't a strong reason for this. There's a lot of code overlap there, so it would be ideal to fix both types of operations at once. Do you think you could take a stab at that?

          Show
          thobbs Tyler Hobbs added a comment - Thanks, Jake. I strongly prefer to default to the strict/safe behavior and make the user supply a "force" option for non-strict behavior, like Nick and Paul agreed on above. If the bootstrapping node cannot stream from the correct replica and the "force" option isn't set, it should abort the bootstrap with an error that describes the implications and mentions how to use the "force" option. Additionally, I think your logic for picking the preferred replica could be greatly simplified. Paul's 2434-3.patch.txt has a really simple version of this and also has the strict-by-default behavior. It might be worthwhile to look at rebasing that patch as a start. Paul mentioned this: Conversation on #cassandra-dev resulted in the conclusion that we'll fix this bug for range acquisition (bootstrap and move) now, and plan to allow the same looseness (non-strict mode, or whatever) for range egress (move and decom) in the future. Looking at the irc logs, there wasn't a strong reason for this. There's a lot of code overlap there, so it would be ideal to fix both types of operations at once. Do you think you could take a stab at that?
          Hide
          tjake T Jake Luciani added a comment -

          Additionally, I think your logic for picking the preferred replica could be greatly simplified. Paul's 2434-3.patch.txt has a really simple version of this and also has the strict-by-default behavior. It might be worthwhile to look at rebasing that patch as a start.

          I did look at the patch and I'll see how I can simplify my version. Most of the complexity comes from multiple ranges living on the same address (vnodes). Which the old version didn't have to worry about.

          I do think we can fix the other operations but those are less of a priority IMO and should be part of a follow up. (does anyone use move with vnodes?)

          Show
          tjake T Jake Luciani added a comment - Additionally, I think your logic for picking the preferred replica could be greatly simplified. Paul's 2434-3.patch.txt has a really simple version of this and also has the strict-by-default behavior. It might be worthwhile to look at rebasing that patch as a start. I did look at the patch and I'll see how I can simplify my version. Most of the complexity comes from multiple ranges living on the same address (vnodes). Which the old version didn't have to worry about. I do think we can fix the other operations but those are less of a priority IMO and should be part of a follow up. (does anyone use move with vnodes?)
          Hide
          brandon.williams Brandon Williams added a comment -

          does anyone use move with vnodes?

          No, but now they can use relocate (taketoken in nodetool)

          Show
          brandon.williams Brandon Williams added a comment - does anyone use move with vnodes? No, but now they can use relocate (taketoken in nodetool)
          Hide
          tjake T Jake Luciani added a comment -

          Pushed an update to https://github.com/tjake/cassandra/tree/2434 that addresses the comments. I'm going to work on support for move/relocate.

          Show
          tjake T Jake Luciani added a comment - Pushed an update to https://github.com/tjake/cassandra/tree/2434 that addresses the comments. I'm going to work on support for move/relocate.
          Hide
          tjake T Jake Luciani added a comment -

          Do we care about decommissions? It seems when we "push" data to other nodes there isn't anything todo. Only when we pick the replica to stream from does this ticket apply

          Show
          tjake T Jake Luciani added a comment - Do we care about decommissions? It seems when we "push" data to other nodes there isn't anything todo. Only when we pick the replica to stream from does this ticket apply
          Hide
          thobbs Tyler Hobbs added a comment -

          Do we care about decommissions? It seems when we "push" data to other nodes there isn't anything todo. Only when we pick the replica to stream from does this ticket apply

          I checked that StorageService.getChangedRangesForLeaving() pushes to the correct nodes (those that are gaining a range), so you're right, we don't need to do anything new for decom.

          Show
          thobbs Tyler Hobbs added a comment - Do we care about decommissions? It seems when we "push" data to other nodes there isn't anything todo. Only when we pick the replica to stream from does this ticket apply I checked that StorageService.getChangedRangesForLeaving() pushes to the correct nodes (those that are gaining a range), so you're right, we don't need to do anything new for decom.
          Hide
          tjake T Jake Luciani added a comment -

          Updated branches with move/relocate support and added a dtest for move (in dtest branch linked above)

          Show
          tjake T Jake Luciani added a comment - Updated branches with move/relocate support and added a dtest for move (in dtest branch linked above)
          Hide
          brandon.williams Brandon Williams added a comment -

          We should probably add a test for relocate too since it's fundamentally different from move.

          Show
          brandon.williams Brandon Williams added a comment - We should probably add a test for relocate too since it's fundamentally different from move.
          Hide
          thobbs Tyler Hobbs added a comment -

          I tested bootstrapping a node while the preferred replica was down. It turns out that CASSANDRA-6385 makes the bootstrapping node consider the replica up for long enough to pass the checks. It looks like we need to special case the 6385 behavior for bootstraps if we want this patch to work.

          Show
          thobbs Tyler Hobbs added a comment - I tested bootstrapping a node while the preferred replica was down. It turns out that CASSANDRA-6385 makes the bootstrapping node consider the replica up for long enough to pass the checks. It looks like we need to special case the 6385 behavior for bootstraps if we want this patch to work.
          Hide
          brandon.williams Brandon Williams added a comment -

          You can test this now by setting cassandra.fd_initial_value_ms.

          Show
          brandon.williams Brandon Williams added a comment - You can test this now by setting cassandra.fd_initial_value_ms.
          Hide
          thobbs Tyler Hobbs added a comment - - edited

          Okay, with the workaround on the FD, bootstrap seems to work. Do we want to split that fix into a separate ticket?

          However, relocate seems to be seriously broken. With a three node cluster and one of the nodes down, I can make relocate fail in a couple of ways:

          • oldEndpoints == newEndpoints, so the assertion that the difference between them has length 1 fails
          • There are no ranges that contain the "desiredRange", resulting in the IllegalStateException being thrown ("No sources found for " + toFetch);

          With that said, nothing (including the tools) uses relocate. (EDIT: shuffle uses it, but nobody uses shuffle in practice due to other problems.) The JMX version doesn't work with jconsole, so I had to add a method to test this. I'm not even sure that relocate worked before this patch for vnodes, because there's only minimal test coverage for relocate. IMO, we shouldn't even try to modify this without good test coverage. But if nothing even uses relocate... I'm not sure what to do. Thoughts?

          Show
          thobbs Tyler Hobbs added a comment - - edited Okay, with the workaround on the FD, bootstrap seems to work. Do we want to split that fix into a separate ticket? However, relocate seems to be seriously broken. With a three node cluster and one of the nodes down, I can make relocate fail in a couple of ways: oldEndpoints == newEndpoints , so the assertion that the difference between them has length 1 fails There are no ranges that contain the "desiredRange", resulting in the IllegalStateException being thrown ("No sources found for " + toFetch); With that said, nothing (including the tools) uses relocate. (EDIT: shuffle uses it, but nobody uses shuffle in practice due to other problems.) The JMX version doesn't work with jconsole, so I had to add a method to test this. I'm not even sure that relocate worked before this patch for vnodes, because there's only minimal test coverage for relocate. IMO, we shouldn't even try to modify this without good test coverage. But if nothing even uses relocate... I'm not sure what to do. Thoughts?
          Hide
          tjake T Jake Luciani added a comment -

          I will try working on a relocate test and look at addressing this

          Show
          tjake T Jake Luciani added a comment - I will try working on a relocate test and look at addressing this
          Hide
          tjake T Jake Luciani added a comment -

          What version should this go into? I personally need this for 1.2 but I would put it in with the default of non-strict.

          Show
          tjake T Jake Luciani added a comment - What version should this go into? I personally need this for 1.2 but I would put it in with the default of non-strict.
          Hide
          thobbs Tyler Hobbs added a comment -

          T Jake Luciani I think we want 2.1 with a default of strict.

          Show
          thobbs Tyler Hobbs added a comment - T Jake Luciani I think we want 2.1 with a default of strict.
          Hide
          tjake T Jake Luciani added a comment -

          Rebased to 2.1 branch and pushed to https://github.com/tjake/cassandra/tree/2434-2

          Also added a relocation dtest https://github.com/tjake/cassandra-dtest/tree/2434

          Show
          tjake T Jake Luciani added a comment - Rebased to 2.1 branch and pushed to https://github.com/tjake/cassandra/tree/2434-2 Also added a relocation dtest https://github.com/tjake/cassandra-dtest/tree/2434
          Hide
          thobbs Tyler Hobbs added a comment -

          +1 overall, with a few minor nitpicks on the dtest:

          You can replace that string-building loop with:

          tl = " ".join(str(t) for t in tokens[0][:8])
          

          There's also a leftover line: #assert 1 == 0

          Don't forget to mark the new tests with a @since('2.1') decorator.

          Show
          thobbs Tyler Hobbs added a comment - +1 overall, with a few minor nitpicks on the dtest: You can replace that string-building loop with: tl = " ".join(str(t) for t in tokens[0][:8]) There's also a leftover line: #assert 1 == 0 Don't forget to mark the new tests with a @since('2.1') decorator.
          Hide
          tjake T Jake Luciani added a comment -

          Committed c* code , I'll push to dtests now

          Show
          tjake T Jake Luciani added a comment - Committed c* code , I'll push to dtests now
          Hide
          kohlisankalp sankalp kohli added a comment -

          This will be very nice to have in 2.0
          cc Brandon Williams

          Show
          kohlisankalp sankalp kohli added a comment - This will be very nice to have in 2.0 cc Brandon Williams
          Hide
          brandon.williams Brandon Williams added a comment -

          Does seem like mostly new code we could just add with the flag defaulting to off to give 2.0 the option.

          Show
          brandon.williams Brandon Williams added a comment - Does seem like mostly new code we could just add with the flag defaulting to off to give 2.0 the option.

            People

            • Assignee:
              tjake T Jake Luciani
              Reporter:
              scode Peter Schuller
              Reviewer:
              Tyler Hobbs
            • Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development