Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-4071

Topology changes can lead to bad counters (at RF=1)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      A counter is broken into shards (partitions), each shard being 'owned' by a given replica (meaning that only this replica will increment that shard). For a given node A, the resolution of 2 shards (having the same owner) follows the following rules:

      • if the shards are owned by A, then sum the values (in the original patch, 'owned by A' was based on the machine IP address, in the current code, it's based on the shard having a delta flag but the principle is the same)
      • otherwise, keep the maximum value (based on the shards clocks)

      During topology changes (boostrap/move/decommission), we transfer data from A to B, but the shards owned by A are not owned by B (and we cannot make them owned by B because during those operations (boostrap, ...) a given shard would be owned by A and B which would break counters). But this means that B won't interpret the streamed shards correctly.

      Concretely, if A receives a number of counter increments that end up in different sstables (the shards should thus be summed) and then those increments are streamed to B as part of boostrap, B will not sum the increments but use the clocks to keep the maximum value.

      I've pushed a test that show the breakeage at https://github.com/riptano/cassandra-dtest/commits/counters_test (the test needs CASSANDRA-4070 to work correctly).

      Note that in practice, replication will hide this (even though B will have the bad value after the boostrap, read or read/repair from the other replica will repair it). This is a problem for RF=1 however.

      Another problem is that during repair, a node won't correctly repair other nodes on it's own shards (unless everything is fully compacted).

        Activity

        Hide
        jbellis Jonathan Ellis added a comment -

        fixed in CASSANDRA-6504

        Show
        jbellis Jonathan Ellis added a comment - fixed in CASSANDRA-6504
        Hide
        simsky Kevin Ye added a comment -

        Is there any progress in fixing the bug ? We rely strongly on counters, when a new node jion the cluster, we will be in trouble.

        Show
        simsky Kevin Ye added a comment - Is there any progress in fixing the bug ? We rely strongly on counters, when a new node jion the cluster, we will be in trouble.
        Hide
        slebresne Sylvain Lebresne added a comment -

        Could we introduce something like gc_g_s for node ids so renewing doesn't have to be avoided so religiously?

        The problem is that when we renew a counterId, the existing shards are still parts of all those counters. So the only way to get rid of those shards would be to merge their values into some newer shard. But as far as I can tell (and I've put quite some though into that), this is really hard to do correctly, because it's not something you can do in isolation of other nodes.

        That being said, we do have some mechanism that does exactly that shard merging in a few cases. I.e. if a node detects that a counter has 2 shards for 2 old counterId of itself, it merges them in two phases: first we merge one of the shard into the other but we keep the first shard with a value of 0. And only when we consider this has been propagated to all replica (after gc_grace basically), do we remove the zeroed shard. However:

        1. this shard merging is already a fairly dark and hacky corner of the counter implementation. I'd rather remove it than complicate it further.
        2. it only works for counter that don't change ownership basically. As soon as a counter changes ownership, whatever old counterId it has won't ever be merged. The reason is that a node only knows about it's own counterId because that's the only simple way I've found to ensure 2 nodes don't start doing the 2-phase merge on the same counterIds at the same time.
        Show
        slebresne Sylvain Lebresne added a comment - Could we introduce something like gc_g_s for node ids so renewing doesn't have to be avoided so religiously? The problem is that when we renew a counterId, the existing shards are still parts of all those counters. So the only way to get rid of those shards would be to merge their values into some newer shard. But as far as I can tell (and I've put quite some though into that), this is really hard to do correctly, because it's not something you can do in isolation of other nodes. That being said, we do have some mechanism that does exactly that shard merging in a few cases. I.e. if a node detects that a counter has 2 shards for 2 old counterId of itself, it merges them in two phases: first we merge one of the shard into the other but we keep the first shard with a value of 0. And only when we consider this has been propagated to all replica (after gc_grace basically), do we remove the zeroed shard. However: this shard merging is already a fairly dark and hacky corner of the counter implementation. I'd rather remove it than complicate it further. it only works for counter that don't change ownership basically. As soon as a counter changes ownership, whatever old counterId it has won't ever be merged. The reason is that a node only knows about it's own counterId because that's the only simple way I've found to ensure 2 nodes don't start doing the 2-phase merge on the same counterIds at the same time.
        Hide
        jbellis Jonathan Ellis added a comment -

        Could we introduce something like gc_g_s for node ids so renewing doesn't have to be avoided so religiously?

        Show
        jbellis Jonathan Ellis added a comment - Could we introduce something like gc_g_s for node ids so renewing doesn't have to be avoided so religiously?
        Hide
        slebresne Sylvain Lebresne added a comment -

        I'm beginning to think we've made a mistake in the counters design. Namely, when we write a new increment to the "leader", we write the increment, then read (which merge the new increments to the previous ones), then send that to the other replica. But that's why we have all the delta business that is the root cause for this and for CASSANDRA-4417 (and we're not even sure we understand all the case that can produce the error message of CASSANDRA-4417).

        An alternative would be instead that when the leader receives a new increment, it reads, apply the increment to the value read, and write the result. If we do that, we don't need delta anymore, fixing this issue as well as CASSANDRA-4417. We also don't ever have to renew nodeId, so we also fix the problem of increasing counter context. And overall we greatly simplify the code. There would be clear performance downsides however:

        1. we will have to lock during the read, apply increment, write result dance.
        2. we still read before the first write, so replicate_on_write won't be an option anymore (I've always been clear how I personally feel about this option in the first place so that would almost be an advantage in my opinion, but some disagree). But it will also increase the latency of writes at CL.ONE.

        But even if we decide to go that route, another thing to go into account is that I don't know how to support the upgrade to this new way of doing things without requiring a major compaction on upgrade (which is particularly a problem for LeveledCompaction because we don't even know how to major compact).

        So definitively not perfect, but the best idea I've had so far.

        Show
        slebresne Sylvain Lebresne added a comment - I'm beginning to think we've made a mistake in the counters design. Namely, when we write a new increment to the "leader", we write the increment, then read (which merge the new increments to the previous ones), then send that to the other replica. But that's why we have all the delta business that is the root cause for this and for CASSANDRA-4417 (and we're not even sure we understand all the case that can produce the error message of CASSANDRA-4417 ). An alternative would be instead that when the leader receives a new increment, it reads, apply the increment to the value read, and write the result. If we do that, we don't need delta anymore, fixing this issue as well as CASSANDRA-4417 . We also don't ever have to renew nodeId, so we also fix the problem of increasing counter context. And overall we greatly simplify the code. There would be clear performance downsides however: we will have to lock during the read, apply increment, write result dance. we still read before the first write, so replicate_on_write won't be an option anymore (I've always been clear how I personally feel about this option in the first place so that would almost be an advantage in my opinion, but some disagree). But it will also increase the latency of writes at CL.ONE. But even if we decide to go that route, another thing to go into account is that I don't know how to support the upgrade to this new way of doing things without requiring a major compaction on upgrade (which is particularly a problem for LeveledCompaction because we don't even know how to major compact). So definitively not perfect, but the best idea I've had so far.
        Hide
        slebresne Sylvain Lebresne added a comment -

        I'm starting to wonder if we shouldn't just create a separate streaming for counter that would merge everything it has to stream on the fly and stream that (not likely our old anti-compaction). That would pretty much be equivalent to my solution above of "doing a major compaction on the source before doing any streaming", but it would suck a little bit less (not by much but I don't have a better solution so far). Of course that would make streaming much less efficient for counters column families (on top of the pain of having two completely different path for streaming).

        Show
        slebresne Sylvain Lebresne added a comment - I'm starting to wonder if we shouldn't just create a separate streaming for counter that would merge everything it has to stream on the fly and stream that (not likely our old anti-compaction). That would pretty much be equivalent to my solution above of "doing a major compaction on the source before doing any streaming", but it would suck a little bit less (not by much but I don't have a better solution so far). Of course that would make streaming much less efficient for counters column families (on top of the pain of having two completely different path for streaming).
        Hide
        slebresne Sylvain Lebresne added a comment -

        We already do some form of node id regeneration on cleanup to avoid some other type of problem. But the problem is, if we do something like what's proposed above, we would add back the problem that the current regeneration is trying to prevent. That problem is the following: say a node B is boostrapped/moved and get some ranges R from node A. Then suppose that shortly after that B is decom (for some reason) so that A gets back R. Then if we don't remove the deltas during streaming and cleanup has not been ran on A we'll end up doubling the value of the shards.

        In other words, for a given shard, we must ensure that only one node in the cluster at most has deltas on that shard. So either streaming don't transfer deltas (current behavior) or we must ensure that when it does transfer them, it removes then locally.

        That being said, I don't know what is the worst problem. Maybe it'd be better to start transferring deltas (pretty much Peter's suggestion) and work on making sure cleanup is ran just after streaming, but it's not ideal in any case.

        I'll also note that regenerating node id is not something we should do lightly because each time it makes the counter values bigger (and that permanently in a number of cases). Typically, if we want to fix the repair problem we would have to regenerate the node id on the start of every repair, which is probably a curse worst than the disease.

        Show
        slebresne Sylvain Lebresne added a comment - We already do some form of node id regeneration on cleanup to avoid some other type of problem. But the problem is, if we do something like what's proposed above, we would add back the problem that the current regeneration is trying to prevent. That problem is the following: say a node B is boostrapped/moved and get some ranges R from node A. Then suppose that shortly after that B is decom (for some reason) so that A gets back R. Then if we don't remove the deltas during streaming and cleanup has not been ran on A we'll end up doubling the value of the shards. In other words, for a given shard, we must ensure that only one node in the cluster at most has deltas on that shard. So either streaming don't transfer deltas (current behavior) or we must ensure that when it does transfer them, it removes then locally. That being said, I don't know what is the worst problem. Maybe it'd be better to start transferring deltas (pretty much Peter's suggestion) and work on making sure cleanup is ran just after streaming, but it's not ideal in any case. I'll also note that regenerating node id is not something we should do lightly because each time it makes the counter values bigger (and that permanently in a number of cases). Typically, if we want to fix the repair problem we would have to regenerate the node id on the start of every repair, which is probably a curse worst than the disease.
        Hide
        scode Peter Schuller added a comment -

        This is not entirely thought through, but: Suppose, upon streaming, the source node regenerates its node id. Further suppose the target node is able to determine, for a given nodeid, whether it is "frozen". It could now opt not to scrub the delta flag on incoming sstables for shards with that nodeid. As far as I can tell, shards for a node id are safe to interpret as deltas if it is known that they will never ever have to be updated. Given that the source node as regenerated it's node id, I think this is the case?

        It feels like there are issues with this, but it's just a thought. Potential concerns I can think of:

        • Will read repair generate new shards for an old nodeid? I don't think so.
        • If old shards get removed on the new owner (stream destination) prior to the old owner no longer being responsible for the data, could that cause a problem?

        I am not really suggesting this be done, it seems too complex/fragile to me. But worth mentioning.

        Show
        scode Peter Schuller added a comment - This is not entirely thought through, but: Suppose, upon streaming, the source node regenerates its node id. Further suppose the target node is able to determine, for a given nodeid, whether it is "frozen". It could now opt not to scrub the delta flag on incoming sstables for shards with that nodeid. As far as I can tell, shards for a node id are safe to interpret as deltas if it is known that they will never ever have to be updated. Given that the source node as regenerated it's node id, I think this is the case? It feels like there are issues with this, but it's just a thought. Potential concerns I can think of: Will read repair generate new shards for an old nodeid? I don't think so. If old shards get removed on the new owner (stream destination) prior to the old owner no longer being responsible for the data, could that cause a problem? I am not really suggesting this be done, it seems too complex/fragile to me. But worth mentioning.
        Hide
        slebresne Sylvain Lebresne added a comment -

        I have no clue how to fix this in a reasonable way. One way to fix it would be to do a major compaction on the source before doing any streaming (at least of what we plan to stream), which doesn't sound very fun.

        Show
        slebresne Sylvain Lebresne added a comment - I have no clue how to fix this in a reasonable way. One way to fix it would be to do a major compaction on the source before doing any streaming (at least of what we plan to stream), which doesn't sound very fun.

          People

          • Assignee:
            Unassigned
            Reporter:
            slebresne Sylvain Lebresne
          • Votes:
            1 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development