Details
-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Fixed
-
Fix Version/s: 2.1 beta2
-
Component/s: None
-
Labels:None
Description
My reading (a while ago) of the code indicates that there is no logic involved during bootstrapping that avoids consistency level violations. If I recall correctly it just grabs neighbors that are currently up.
There are at least two issues I have with this behavior:
- If I have a cluster where I have applications relying on QUORUM with RF=3, and bootstrapping complete based on only one node, I have just violated the supposedly guaranteed consistency semantics of the cluster.
- Nodes can flap up and down at any time, so even if a human takes care to look at which nodes are up and things about it carefully before bootstrapping, there's no guarantee.
A complication is that not only does it depend on use-case where this is an issue (if all you ever do you do at CL.ONE, it's fine); even in a cluster which is otherwise used for QUORUM operations you may wish to accept less-than-quorum nodes during bootstrap in various emergency situations.
A potential easy fix is to have bootstrap take an argument which is the number of hosts to bootstrap from, or to assume QUORUM if none is given.
(A related concern is bootstrapping across data centers. You may want to bootstrap to a local node and then do a repair to avoid sending loads of data across DC:s while still achieving consistency. Or even if you don't care about the consistency issues, I don't think there is currently a way to bootstrap from local nodes only.)
Thoughts?
-
- 2434-3.patch.txt
- 14 kB
- paul cannon
-
- 2434-testery.patch.txt
- 2 kB
- paul cannon
Issue Links
- blocks
-
CASSANDRA-3516
Make bootstrapping smarter instead of using 120 seconds to stabilize the ring
-
- Resolved
-
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
Related: CASSANDRA-833
So, it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it (other nodes being bootstrapped/moved/decom'd during that time period); in some cases, that could still lead to a consistency violation. Is that unlikely enough that we don't care, here? At least the situation would be better with the proposed fix than it is now.
Second question: what might the "permission from the operator to choose a replica that is closer/less dead" look like? Maybe just a boolean flag saying "it's ok to stream from any node for any range you need to stream"? Or would we want to allow specifying precise source nodes for any/all affected address ranges?
it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it
It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing. Does that cover what you're thinking of or are we still on the hook?
what might the "permission from the operator to choose a replica that is closer/less dead" look like?
It seems to me that the two valid choices are
- Stream from "correct" replica
- Stream from closest replica
I can't think of a reason to stream from an arbitrary replica other than those options.
It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing. Does that cover what you're thinking of or are we still on the hook?
Is it also unsupported to decom within X's token arc, or move into/out of that arc, while X is bootstrapping? I think we're safe if so.
Yes.
This still needs some testing, but I'm putting it up now in case anyone has some time to take a look and make sure my approach is sane.
Adds an optional "-n" argument to "nodetool join" to allow bootstrapping from closest live node (n == non-strict). Also recognizes an optional property "cassandra.join_strict" which can be set to false when a bootstrap is triggered by cassandra.join_ring.
Seems to me like the option to stream from the closest replica might just add more confusion without really gaining anything. The node that is leaving the replica set will never be in another datacenter. It could be on a different rack, but if you are following best practices and alternating racks then it is likely either on the same rack or there is only one copy on that rack and the best case possible is streaming from another rack anyway.
Judging by the irc channel and user list, assuming people will follow best practices seems a bit of a dead end. Plus, what about the case where the node leaving the replica set is dead? You still want the option to allow choosing another to stream from. And we probably shouldn't default to choosing another without explicit permission, because of the consistency violation stuff.
The node that is leaving the replica set will never be in another datacenter
I think that's only strictly true for NTS, but I'm fine leaving it out. Not worth adding complexity for ONTS at this point.
what about the case where the node leaving the replica set is dead
Good point. We do need something to make that possible.
Yeah I was assuming ONTS is basically deprecated at this point. Didn't think about the dead case though. I suppose just a 'force' type of option and a warning indicating the possible consistency issues works.
As long as we need to handle the dead case I don't see any harm in having a slightly more generally-useful "use closest" option instead of "force to pick random live replica" option.
Well, I imagine the 'force' option would pick the nearest live node. By 'force' I mean the option should be posed to the user as "We can't guarantee consistency in your cluster after this bootstrap since a node is down, if you would like to do this anyway, specify option X". Just saying you can either bootstrap or bootstrap from the closest node doesn't convey the implications as well I don't think.
Maybe we are on the same page and arguing over wording though.
Just as initial feedback, I'm not sure we need a new getRangesWithSources method, especially with so much duplication between them. Seems like strict could be passed to the current method. Also, what about leaving getRangesWithSources how it is and passing strict to getWorkMap? That method can do the endpoint set math if it needs to and throw a more informative exception in the case that strict is set and the endpoint we want to fetch from is dead.
I did that (passing strict to getWorkMap) at first, but it wasn't too clean since it required adding a 'table' argument as well as 'strict', and it ended up replacing too much of the getRangesWithSources functionality. So then I added 'strict' as a parameter to getRangesWithSources (actually, it looks like I neglected to update its javadoc comment), but the differences between getRangesWithSources and getRangesWithStrictSource are such that a combined method feels a lot more special-casey and clunky. I like this way best in the end.
Well I guess it kind of depends on which approach we take as well. Is the option A) bootstrap from the right token or bootstrap from the closest token, or B) bootstrap from the right token, but if that one isn't up, bootstrap from any other token preferring the closer ones.
Like I said, I'd say B, but if you and Jonathan both disagree.
Yeah. B is probably easier on everyone, but I would say we simply can't do anything that might violate the consistency guarantee without explicit permission from the user.
Ok, so if we always prefer to bootstrap from the correct token, then I still think we should combine getRangesWithStrictSource and getRangesWithSources. Basically the logic should be, find the 'best' node to stream from. If the user requested it, also find a list of other candidates and order them by proximity. Right?
I'm okay with either A or B.
I would say we simply can't do anything that might violate the consistency guarantee without explicit permission from the user
I'm not sure I understand, are you saying that B would violate this, or just that the status quo does?
Is it possible to make the node does not reply to any request before bootstrap and anti-entrophy repair is finished?
This could fix the consistency problem brought by bootstrap.
Repair is a much, much more heavyweight solution to the problem than just "stream from the node that is 'displaced.'"
Sometimes, the node is replaced because the hardware is crashed. If so, "streaming from the node being replaced" is not available.
How about force the repair happens if the user specifies he needs the consistency of quorum while the original node has gone.
Ok, so if we always prefer to bootstrap from the correct token, then I still think we should combine getRangesWithStrictSource and getRangesWithSources. Basically the logic should be, find the 'best' node to stream from. If the user requested it, also find a list of other candidates and order them by proximity. Right?
I don't think so. I would still want to leave the option to stream from the closest even if the strict best node is available.
I'm not sure I understand, are you saying that B would violate this, or just that the status quo does?
I'm saying B would violate this, yes. B was "bootstrap from the right token, but if that one isn't up, bootstrap from any other token preferring the closer ones", right? I'm saying we can't just automatically choose another token if the user didn't specifically say it's ok.
Paul,
The suggestion was that if the 'correct' node is down, you can force the bootstrap to complete anyway (probably from the closest node, but that is transparent to the user), but only if the 'correct' node is down. It sounds like you agree with Jonathan on the more general approach though.
Zhu,
Repair doesn't help in the case when you lost data due to a node going down. Also if only one node is down you should still be able to read/write at quorum and achieve consistency (assuming your replication factor is greater than 2).
I'm saying we can't just automatically choose another token if the user didn't specifically say it's ok.
Oh, ok. Right. (I thought we were just bikeshedding over whether to call the "manual override" option "use closest" or "force bootstrap.")
Repair doesn't help in the case when you lost data due to a node going down
Additionally, I don't like the idea of automatically doing expensive things like repair; it feels cleaner to not do it automatically, and allow using the existing tool to perform one if desired, than to do it by default and have to add an option to skip it for when that's not desirable.
Also if only one node is down you should still be able to read/write at quorum and achieve consistency
I suppose quorum read plus quorum write should provide monotonic read consistency. [1] Supposing quorum write on key1 hits node A and node B, not on node C due to temporal network partition. After that node B is replaced by node D since it is down, and node D streams data from node C. If the following quorum read on key1 hits only node C and node D, the monotonic consistency is violated. This is rare but not unrealistic, especially when hint handoff is disabled.
Maybe it is more resonable to give the admin an option, to specify that the bootstrapped node should not accept any read request until the admin turn it on manually. So the admin can start a manual repair if he wants to assure everything goes fine.
[1]http://www.allthingsdistributed.com/2007/12/eventually_consistent.html
The suggestion was that if the 'correct' node is down, you can force the bootstrap to complete anyway (probably from the closest node, but that is transparent to the user), but only if the 'correct' node is down.
Oh, ok. I misunderstood. This seems reasonable. I'd lean for the more general solution, yeah, but I don't feel very strongly about it.
As peter suggested before, another approach to fix the consistency problem is streaming sstables from all alive peers if the "correct" node is down. And then leave them to normal compaction.
This could be much lightweight than anti-entrophy repair, except the network IO pressure on the bootstrapping node.
updated patch fixes the docstring for getRangesWithStrictSource().
Patch 2434-testery.patch.txt adds a bit to unit tests to exercise o.a.c.dht.BootStrapper.getRangesWithStrictSource().
2434-3.patch.txt removes the bits that add the "-n" option to nodetool join. Apparently no "nodetool join" should ever result in a bootstrap, so it doesn't matter whether the caller wants "strict" or not.
Do we need to do anything special for move/decommission as well?
I had a note to remember to create a ticket for that, but if we want to do it here that works as well.
In any case, yes the same concerns exist when giving away ranges as when gaining ranges.
Do we need to do anything special for move/decommission as well?
Yes, it looks like we do need to add similar logic for move. Expand the scope of this ticket accordingly?
I don't see any way decommission could be affected by this sort of problem.
In any case, yes the same concerns exist when giving away ranges as when gaining ranges.
Oh? I must be missing something. What would a consistency violation failure scenario look like for giving away ranges?
Expand the scope of this ticket accordingly?
Yes, let's solve them both here.
My comment wasn't very clear. Both decom and move currently, attempt to do the right thing. When a node is leaving, there should be one new replica for all the ranges it is responsible for. If it can't stream data to that replica there is a consistency problem.
Both operations currently try to do stream to that replica, but we should use the 'strict' logic in those cases as well and fail if we can't guarantee consistency and the user hasn't disabled strict.
If decom can't stream data to the appropriate replica, then it should just fail, right? Do we support decom in cases where a consistency violation would result? Seems like it has to be the user's responsibility to bring up or decom the other node first.
move could introduce a violation when it gains a new range, though, in the same cases as the bootstrap issue explained above.
If we support it for bootstrapping I don't see why we shouldn't support it for decom. Right, move has the problem in both cases (giving away ranges, gaining ranges).
I think we're talking about different things. Requiring the user to have the right nodes available for operation X is not the same as "cassandra can 'lose' writes when it happens to stream from the wrong node, even if the user did everything right".
This ticket is about the latter, I think.
Conversation on #cassandra-dev resulted in the conclusion that we'll fix this bug for range acquisition (bootstrap and move) now, and plan to allow the same looseness (non-strict mode, or whatever) for range egress (move and decom) in the future.
I think.
It's always been unsupported to bootstrap a second node into the same "token arc" while a previous one is ongoing.
I'm pretty sure now that this is incorrect; we fixed it back in CASSANDRA-603. I'm updating the comments in TokenMetadata as follows:
// Prior to CASSANDRA-603, we just had <tt>Map<Range, InetAddress> pendingRanges<tt>,
// which was added to when a node began bootstrap and removed from when it finished.
//
// This is inadequate when multiple changes are allowed simultaneously. For example,
// suppose that there is a ring of nodes A, C and E, with replication factor 3.
// Node D bootstraps between C and E, so its pending ranges will be E-A, A-C and C-D.
// Now suppose node B bootstraps between A and C at the same time. Its pending ranges
// would be C-E, E-A and A-B. Now both nodes need to be assigned pending range E-A,
// which we would be unable to represent with the old Map. The same thing happens
// even more obviously for any nodes that boot simultaneously between same two nodes.
//
// So, we made two changes:
//
// First, we changed pendingRanges to a <tt>Multimap<Range, InetAddress></tt> (now
// <tt>Map<String, Multimap<Range, InetAddress>></tt>, because replication strategy
// and options are per-KeySpace).
//
// Second, we added the bootstrapTokens and leavingEndpoints collections, so we can
// rebuild pendingRanges from the complete information of what is going on, when
// additional changes are made mid-operation.
//
// Finally, note that recording the tokens of joining nodes in bootstrapTokens also
// means we can detect and reject the addition of multiple nodes at the same token
// before one becomes part of the ring.
private BiMap<Token, InetAddress> bootstrapTokens = HashBiMap.create();
// (don't need to record Token here since it's still part of tokenToEndpointMap until it's done leaving)
private Set<InetAddress> leavingEndpoints = new HashSet<InetAddress>();
// this is a cache of the calculation from {tokenToEndpointMap, bootstrapTokens, leavingEndpoints}
private ConcurrentMap<String, Multimap<Range, InetAddress>> pendingRanges = new ConcurrentHashMap<String, Multimap<Range, InetAddress>>();
I'm pretty sure now that this is incorrect;
Well, doh. That puts us back at my first question:
So, it looks like it will be possible for the node-that-will-be-removed to change between starting a bootstrap and finishing it (other nodes being bootstrapped/moved/decom'd during that time period); in some cases, that could still lead to a consistency violation. Is that unlikely enough that we don't care, here? At least the situation would be better with the proposed fix than it is now.
Can you give an example for illustration?
Ok, so I think there are really two consistency issues here. Firstly, picking the 'right' node to stream data to/from when making changes to the ring. Secondly, disallowing concurrent changes that have overlapping ranges.
Currently we only disallow nodes from moving/decommissioning when they may potentially have data being streamed to them. There are a few examples of things we currently allow which I think are generally a bad idea.
1) Say you have nodes A and D, if you bootstrap nodes B and C at the same time in between A and D, it may turn out that the correct node to stream from for both nodes is D. Now say node C finishes bootstrapping before node B. At that point, the correct node for B to bootstrap from is technically C, although D still has the data. However, since D is no longer technically responsible for the data, the user could run cleanup on D and delete the data that B is attempting to stream.
2) The above case is also a problem when you bootstrap a node and the node it decides it needs to stream from is moving. Once that node finishes moving you could run cleanup on that node and delete data that the bootstrapping node needs. In this case, all documentation indicates you should do a cleanup after a move in order to remove old data, so it seems possibly more likely.
3) A variation of the above case is when you bootstrap a node and the node it streams from is leaving. In that case the decom may finish and the user could terminate the cassandra process and/or node breaking any streams. Not to mention the idea of a node in a decommissioned state continuing to stream seems like a bad idea. I believe it would work currently, but I'm not sure and it seems likely to break.
I can't really think of any other examples but I think thats enough to illustrate that overlapping concurrent ring changes are a bad idea and we should just attempt to prevent them in all cases. An argument could be made that this would prevent you from doubling your cluster (the best way to grow) all at once, but I don't think that's really a huge deal. At most you would need RF steps to double your cluster.
I think we can still allow overlapping concurrent ring changes, with the right set of invariants and/or operational rules. I've been trying to work on defining those. It's pretty tricky but I think I have the right way of approaching the model now. Will update later today with more.
Nick is right though, c* probably shouldn't support overlapping changes as is. Think bootstrapping >N nodes between the same two old nodes, decom'ing one node and bootstrapping another in such a way that the bootstrap source stream node becomes the wrong source, etc.
So the fact that the 'correct' bootstrap source switches mid stream isn't really the problem I don't think. We set up pending ranges so that when a node gets ready to bootstrap, writes start getting duplicated to both the currently correct replica, and the bootstrapping node. Since all new writes are duplicated we can stream from that node and as long as we get the entire dataset, consistency should be fine. The problem is there is nothing in place preventing someone from running cleanup or killing a decommed node or something.
I'm doubtful that the complexity of a correct set of rules for allowing overlapping ring changes is really worth the time/effort/fragility. It doesn't seem like that much of a loss to me to disallow them. Perhaps your set of rules will be super simple though
.
Right, I know how the pending-ranges writes work. But there are still possible openings for consistency violations for previously-written data, similar to the one outlined in the original ticket description. If the bootstrap stream source has an outdated value and it gets duplicated to a bootstrapping node, but then the bootstrap stream source doesn't leave the replication set (because something else changed in the interim), the outdated value is now on more nodes than it used to be- possibly now QUORUM, when previously it was safely below.
Ok, prospective approach to totally safe range movements:
Operational rules:
- Cassandra will not allow two range motion operations (move, bootstrap, decom) at the same time on the same node.
- When a range motion operation is already pending, User should refrain from starting another range motion operation (if either motion operation overlaps the arc-of-effect of the other) until the gossip info about the first change has propagated to all affected nodes. (This is more simply approximated by the "two minute rule".)
- Every point in the tokenspace has the same number of natural endpoints, and they're ordered the same from the perspective of all nodes (is this an ok assumption?).
- It is User's responsibility to make sure that the right streaming source nodes are available. If they're not, the range motion operation may fail.
Procedure:
- For any motion involving range R, there will be a stream from endpoint EP_source to endpoint EP_dest. Given the same information about what range motion operations are pending (TokenMetadata) and the range R, there is a bijection from EP_source to EP_dest, shared by all nodes in the ring.
- Procedure to determine EP_source from EP_dest:
- Let REP_current be the existing (ordered) list of natural endpoints for R.
- Let TM_future be a clone of the current TokenMetadata, but with all ongoing bootstraps, moves, and decoms resolved and completed.
- Let REP_future be the list of (ordered) natural endpoints for R according to TM_future.
- Let EPL_entering be the list of endpoints in REP_future which are not in REP_current (preserving their order in REP_future).
- Let EPL_leaving be the list of endpoints in REP_current which are not in REP_future (preserving their order in REP_current).
- EPL_entering and EPL_leaving are of the same length.
- Let Pos be the position/index of EP_dest in EPL_entering.
- Let EP_source be the endpoint at position Pos in EPL_leaving.
- Intuitively, this is the same as the rule expressed earlier in this ticket (stream from the node you'll replace), but also handles other ongoing range movements in the same token arc.
- These rules can be pretty trivially inverted to determine EP_dest from EP_source.
- When any node gets gossip about a range motion occurring with its token arc-of-effect, it calculates (or recalculates) the streams in which it should be involved. Any ongoing streams which are no longer necessary are canceled, and any newly necessary streams are instigated.
I tried to construct a ruleset without that last rearrange-ongoing-streams rule, but it ended up with a pretty complicated set of extra restrictions, and a more complicated set of procedures than this.
This set of rules might look complicated, but I think it should be fairly straightforward to implement, and may even end up simpler overall than our current code.
Note that this procedure even maintains the consistency guarantee in cases like:
- In an RF=3 cluster with nodes A, E, and F, bootstrap B, C, and D in quick succession (E streams to B, F streams to C, A streams to D)
- In an RF=3 cluster with nodes A, C, and E, bootstrap B, D, and F, and decommission A, C, and E, all in quick succession (A streams to B, C streams to D, E streams to F)
- In an RF=3 cluster with nodes A, B, C, D, and E, decommission B and C in quick succession (B streams to D, C streams to E)
Clarification: the cleanup operation would need to consider pending ranges in addition to the current natural range-endpoint mapping.
It would be possible with this proposal for a node M to be streaming range S to a bootstrapping node Y, and midway, for M to stop being part of the replication set for S (perhaps some other bootstraps nearby completed first). This should be ok for consistency, but a cleanup operation on M while the stream is ongoing could potentially remove all the data in S unless this change is made.
Further clarification: instead of the W+1 special case we now have for writing to a range T with a pending motion, we would need to write to W+C replicas instead, where C is the number of pending motions within the replication set of T.
Coming back to this after the 1.0 scramble.
It sounds like you have a reasonable solution here, is there any reason not to implement it for 1.1?
It sounds like you have a reasonable solution here, is there any reason not to implement it for 1.1?
Just that it's quite a bit more complex than simply disallowing overlapping ring movements, and the extra problems that come with higher complexity. I think this feature is worth it, on its own, but when i think of how much pain Brandon seems to be going through dealing with streaming code, maybe it's not.
No longer very optimistic on the "may even end up simpler overall than our current code" front?
TBH this area of the code is fragile and hairy and maybe starting from a clean slate with a real plan instead of trying to patch things in haphazardly would be a good thing.
But, I'd be okay with re-imposing the "no overlapping moves" rule and fixing the stream source problem if that's going to be substantially simpler.
No, I do think that if we tore out the existing code and replaced it, it would be simpler overall, but (a) that would probably also be true if we rewrote the existing code without implementing this; (b) it will be rather a lot of work; and (c) it may engender a whole new generation of subtle corner-case bugs (or maybe it will eliminate a lot of such bugs that already exist).
How much work would it be to add "just one more bandaid" for the stream source thing, in comparison?
Do you mean to implement the new "safe range movements" procedure outlined above, without rewriting the rest of the range movement code?
If so, I submit a SWAG of "1/3-1/2 the cost of the rewrite option". The bandaid would still touch a lot of different moving parts.
So either way, a substantial amount of work, but the bandaid still leaves us with rules that the operator must enforce or hit subtle problems.
My hang-up with the bandaid is the part where C* can't effectively enforce the guidelines to be safe. Even "wait X seconds between bootstrap operations" is not a prereq I am comfortable with.
Unless the above is incorrect, I think we should bite the bullet and fix it "right."
So, I believe that the rules outlined above can still work without the "wait X seconds between bootstrap operations" prereq, if a pretty simple extra step is added:
If any node learns about conflicting move operations, then some rules are applied to choose which will be honored and which will return an error to its caller (if still possible).
Those rules are:
- A decom for node X beats a move or bootstrap for node X
- Two decoms for node X from coordinator nodes Y and Z: the coordinator with the higher token wins
- Any other conflicts between move/bootstrap operations for the same node (which can arise in certain partition situations) are easily resolved by latest VersionedValue.
This should guarantee convergence of TokenMetadata across any affected parts of a cluster.
Sounds reasonable.
As a half measure, we can stream from the "right" node very easily if we continue to make the simplifying assumption that no other node movement happens in overlapping ranges during the operation.
I've taken a crack at this, initially for 1.2 since it solves my pain. Appreciate a review.
As Jonathan Ellis mentions above it requires only one node to be added at a time. Also bootstrapping node must add -Dconsistent.bootstrap=true
#code
https://github.com/tjake/cassandra/tree/2434
#dtest showing it works (use ENABLE_VNODES=yes)
https://github.com/tjake/cassandra-dtest/tree/2434
(Tyler Hobbs to review)
Thanks, Jake.
I strongly prefer to default to the strict/safe behavior and make the user supply a "force" option for non-strict behavior, like Nick and Paul agreed on above. If the bootstrapping node cannot stream from the correct replica and the "force" option isn't set, it should abort the bootstrap with an error that describes the implications and mentions how to use the "force" option.
Additionally, I think your logic for picking the preferred replica could be greatly simplified. Paul's 2434-3.patch.txt has a really simple version of this and also has the strict-by-default behavior. It might be worthwhile to look at rebasing that patch as a start.
Paul mentioned this:
Conversation on #cassandra-dev resulted in the conclusion that we'll fix this bug for range acquisition (bootstrap and move) now, and plan to allow the same looseness (non-strict mode, or whatever) for range egress (move and decom) in the future.
Looking at the irc logs, there wasn't a strong reason for this. There's a lot of code overlap there, so it would be ideal to fix both types of operations at once. Do you think you could take a stab at that?
Additionally, I think your logic for picking the preferred replica could be greatly simplified. Paul's 2434-3.patch.txt has a really simple version of this and also has the strict-by-default behavior. It might be worthwhile to look at rebasing that patch as a start.
I did look at the patch and I'll see how I can simplify my version. Most of the complexity comes from multiple ranges living on the same address (vnodes). Which the old version didn't have to worry about.
I do think we can fix the other operations but those are less of a priority IMO and should be part of a follow up. (does anyone use move with vnodes?)
does anyone use move with vnodes?
No, but now they can use relocate (taketoken in nodetool)
Pushed an update to https://github.com/tjake/cassandra/tree/2434 that addresses the comments. I'm going to work on support for move/relocate.
Do we care about decommissions? It seems when we "push" data to other nodes there isn't anything todo. Only when we pick the replica to stream from does this ticket apply
Do we care about decommissions? It seems when we "push" data to other nodes there isn't anything todo. Only when we pick the replica to stream from does this ticket apply
I checked that StorageService.getChangedRangesForLeaving() pushes to the correct nodes (those that are gaining a range), so you're right, we don't need to do anything new for decom.
Updated branches with move/relocate support and added a dtest for move (in dtest branch linked above)
We should probably add a test for relocate too since it's fundamentally different from move.
I tested bootstrapping a node while the preferred replica was down. It turns out that CASSANDRA-6385 makes the bootstrapping node consider the replica up for long enough to pass the checks. It looks like we need to special case the 6385 behavior for bootstraps if we want this patch to work.
You can test this now by setting cassandra.fd_initial_value_ms.
Okay, with the workaround on the FD, bootstrap seems to work. Do we want to split that fix into a separate ticket?
However, relocate seems to be seriously broken. With a three node cluster and one of the nodes down, I can make relocate fail in a couple of ways:
- oldEndpoints == newEndpoints, so the assertion that the difference between them has length 1 fails
- There are no ranges that contain the "desiredRange", resulting in the IllegalStateException being thrown ("No sources found for " + toFetch);
With that said, nothing (including the tools) uses relocate. (EDIT: shuffle uses it, but nobody uses shuffle in practice due to other problems.) The JMX version doesn't work with jconsole, so I had to add a method to test this. I'm not even sure that relocate worked before this patch for vnodes, because there's only minimal test coverage for relocate. IMO, we shouldn't even try to modify this without good test coverage. But if nothing even uses relocate... I'm not sure what to do. Thoughts?
I will try working on a relocate test and look at addressing this
What version should this go into? I personally need this for 1.2 but I would put it in with the default of non-strict.
T Jake Luciani I think we want 2.1 with a default of strict.
Rebased to 2.1 branch and pushed to https://github.com/tjake/cassandra/tree/2434-2
Also added a relocation dtest https://github.com/tjake/cassandra-dtest/tree/2434
+1 overall, with a few minor nitpicks on the dtest:
You can replace that string-building loop with:
tl = " ".join(str(t) for t in tokens[0][:8])
There's also a leftover line: #assert 1 == 0
Don't forget to mark the new tests with a @since('2.1') decorator.
Committed c* code , I'll push to dtests now
This will be very nice to have in 2.0
cc Brandon Williams
Does seem like mostly new code we could just add with the flag defaulting to off to give 2.0 the option.
ISTM the easiest fix is to always stream from the node that will be removed from the replicas for each range, unless given permission from the operator to choose a replica that is closer / less dead.