Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.2.0 beta 2
    • Component/s: Core
    • Labels:
      None

      Description

      As originally designed, bootstrap nodes should always get all writes under any consistencylevel, so when bootstrap finishes the operator can run cleanup on the old nodes w/o fear that he might lose data.

      but if a bootstrap operation fails or is aborted, that means all writes will fail until the ex-bootstrapping node is decommissioned. so starting in CASSANDRA-722, we just ignore dead nodes in consistencylevel calculations.

      but this breaks the original design. CASSANDRA-822 adds a partial fix for this (just adding bootstrap targets into the RF targets and hinting normally), but this is still broken under certain conditions. The real fix is to consider consistencylevel for two sets of nodes:

      1. the RF targets as currently existing (no pending ranges)
      2. the RF targets as they will exist after all movement ops are done

      If we satisfy CL for both sets then we will always be in good shape.

      I'm not sure if we can easily calculate 2. from the current TokenMetadata, though.

      1. 0001-Increase-CL-with-boostrapping-leaving-node.patch
        36 kB
        Sylvain Lebresne
      2. 833-v2.txt
        34 kB
        Jonathan Ellis

        Activity

        Hide
        Jonathan Ellis added a comment -

        committed w/ that fix

        Show
        Jonathan Ellis added a comment - committed w/ that fix
        Hide
        Sylvain Lebresne added a comment -

        In counterWriteTask, sendToHintedEndpoints should be called with remotes, not targets.

        But other than that it lgtm. I don't remember either why it was reverted and I don't remember any specific problem with that. But in any case, you reverted it almost right away, so if that wasn't accidental the regression was likely easy to spot, so we'll see soon enough .

        Show
        Sylvain Lebresne added a comment - In counterWriteTask, sendToHintedEndpoints should be called with remotes, not targets. But other than that it lgtm. I don't remember either why it was reverted and I don't remember any specific problem with that. But in any case, you reverted it almost right away, so if that wasn't accidental the regression was likely easy to spot, so we'll see soon enough .
        Hide
        Jonathan Ellis added a comment -

        (Not exactly a pure rebase since I split out pendingRangesFor instead of cramming everything into getWriteEndpoints.)

        Show
        Jonathan Ellis added a comment - (Not exactly a pure rebase since I split out pendingRangesFor instead of cramming everything into getWriteEndpoints.)
        Hide
        Jonathan Ellis added a comment -
        Show
        Jonathan Ellis added a comment - Pushed rebase to https://github.com/jbellis/cassandra/branches/833-3
        Hide
        Jonathan Ellis added a comment -

        Well, WTF.

        commit 063c8f6cf7b12e976b0d7067037c52c548c6c0db
        Author: Jonathan Ellis <jbellis@apache.org>
        Date:   Thu Jun 9 00:16:27 2011 +0000
        
            revert 1133443
            
            git-svn-id: https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.8@1133610 13f79535-47bb-0310-9956-ffa450edef68
        
        commit 31f0ee95e927c09183dca77be7739305ba2eeab0
        Author: Jonathan Ellis <jbellis@apache.org>
        Date:   Wed Jun 8 15:45:54 2011 +0000
        
            fix inconsistency window duringbootstrap
            patch by slebresne; reviewed by jbellis for CASSANDRA-833
            
            git-svn-id: https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.8@1133443 13f79535-47bb-0310-9956-ffa450edef68
        

        I have no memory of this.

        Maybe it caused a regression?

        Show
        Jonathan Ellis added a comment - Well, WTF. commit 063c8f6cf7b12e976b0d7067037c52c548c6c0db Author: Jonathan Ellis <jbellis@apache.org> Date: Thu Jun 9 00:16:27 2011 +0000 revert 1133443 git-svn-id: https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.8@1133610 13f79535-47bb-0310-9956-ffa450edef68 commit 31f0ee95e927c09183dca77be7739305ba2eeab0 Author: Jonathan Ellis <jbellis@apache.org> Date: Wed Jun 8 15:45:54 2011 +0000 fix inconsistency window duringbootstrap patch by slebresne; reviewed by jbellis for CASSANDRA-833 git-svn-id: https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.8@1133443 13f79535-47bb-0310-9956-ffa450edef68 I have no memory of this. Maybe it caused a regression?
        Hide
        Hudson added a comment -

        Integrated in Cassandra-0.8 #158 (See https://builds.apache.org/job/Cassandra-0.8/158/)
        fix inconsistency window duringbootstrap
        patch by slebresne; reviewed by jbellis for CASSANDRA-833

        jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1133443
        Files :

        • /cassandra/branches/cassandra-0.8/CHANGES.txt
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/DatacenterSyncWriteResponseHandler.java
        • /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/locator/SimpleStrategyTest.java
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/StorageProxy.java
        • /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/LeaveAndBootstrapTest.java
        • /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/MoveTest.java
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java
        • /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/ConsistencyLevelTest.java
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/WriteResponseHandler.java
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java
        • /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/locator/TokenMetadata.java
        Show
        Hudson added a comment - Integrated in Cassandra-0.8 #158 (See https://builds.apache.org/job/Cassandra-0.8/158/ ) fix inconsistency window duringbootstrap patch by slebresne; reviewed by jbellis for CASSANDRA-833 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1133443 Files : /cassandra/branches/cassandra-0.8/CHANGES.txt /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/DatacenterSyncWriteResponseHandler.java /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/locator/SimpleStrategyTest.java /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/StorageProxy.java /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/LeaveAndBootstrapTest.java /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/MoveTest.java /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/AbstractWriteResponseHandler.java /cassandra/branches/cassandra-0.8/test/unit/org/apache/cassandra/service/ConsistencyLevelTest.java /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/WriteResponseHandler.java /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/service/DatacenterWriteResponseHandler.java /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/locator/TokenMetadata.java
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed
        Hide
        Sylvain Lebresne added a comment -

        +1

        Show
        Sylvain Lebresne added a comment - +1
        Hide
        Jonathan Ellis added a comment -

        v2 tweaks getWriteEndpoints to avoid new Collection creation where possible, instead using Iterables.concat.

        otherwise lgtm.

        Show
        Jonathan Ellis added a comment - v2 tweaks getWriteEndpoints to avoid new Collection creation where possible, instead using Iterables.concat. otherwise lgtm.
        Hide
        Sylvain Lebresne added a comment -

        Attaching patch that implements the "simplification" idea. The case of

        {LOCAL|EACH}

        _QUORUM requires some care but I think that by considering DC separately we are fine.

        Show
        Sylvain Lebresne added a comment - Attaching patch that implements the "simplification" idea. The case of {LOCAL|EACH} _QUORUM requires some care but I think that by considering DC separately we are fine.
        Hide
        Jonathan Ellis added a comment -

        Consider the case of CL=1, RF=3 to replicas A, B, C. We begin bootstrapping node D, and write a row K to the range being moved from C to D.

        If the cluster is heavily loaded, it's possible that we write one copy to C, all the other writes get dropped, and once bootstrap completes we lose the row. Or if we write one copy to D, and cancel bootstrap, we again lose the row.

        As said above, we want to satisfy CL for both the pre- and post-bootstrap nodes (in case bootstrap aborts). This requires treating the old/new range owner as a unit: both D and C need to accept the write for it to count towards CL. So rather than considering

        {A, B, C, D}

        we should consider

        {A, B, (C, D)}

        .

        This is a lot of complexity to introduce. A simplification that preserves correctness is to continue treating nodes independently but require one more node than normal CL. So CL=1 would actually require 2 nodes; CL=Q would require 3 (for RF=3), and so forth. (Note that Q(3) + 1 is the same as Q(4), which is what the existing code computes; that is one reason I chose a CL=1 example to start with, since those are not the same even for the simple case of RF=3.)

        This would mean we may fail a few writes unnecessarily (a write to A or B is actually sufficient to satisfy CL=1, but this scheme would time that out) but never allow a write to succeed that would leave CL unsatisfied post-bootstrap (or if bootstrap is cancelled).

        Show
        Jonathan Ellis added a comment - Consider the case of CL=1, RF=3 to replicas A, B, C. We begin bootstrapping node D, and write a row K to the range being moved from C to D. If the cluster is heavily loaded, it's possible that we write one copy to C, all the other writes get dropped, and once bootstrap completes we lose the row. Or if we write one copy to D, and cancel bootstrap, we again lose the row. As said above, we want to satisfy CL for both the pre- and post-bootstrap nodes (in case bootstrap aborts). This requires treating the old/new range owner as a unit: both D and C need to accept the write for it to count towards CL. So rather than considering {A, B, C, D} we should consider {A, B, (C, D)} . This is a lot of complexity to introduce. A simplification that preserves correctness is to continue treating nodes independently but require one more node than normal CL. So CL=1 would actually require 2 nodes; CL=Q would require 3 (for RF=3), and so forth. (Note that Q(3) + 1 is the same as Q(4), which is what the existing code computes; that is one reason I chose a CL=1 example to start with, since those are not the same even for the simple case of RF=3.) This would mean we may fail a few writes unnecessarily (a write to A or B is actually sufficient to satisfy CL=1, but this scheme would time that out) but never allow a write to succeed that would leave CL unsatisfied post-bootstrap (or if bootstrap is cancelled).
        Hide
        Jaakko Laine added a comment -

        This issue is not only related to bootstrapping, since nodes leaving the ring will also cause pending ranges. If a node does not complete leaving operation properly, obsolete pending ranges will be left in metadata.

        (2) above is actually almost exactly how pending ranges is calculated. All current move operations are finished and pending ranges is calculated according to what are the new natural endpoints for the ranges in question.

        This is not directly related to bootstrapping IMHO, but to the fact that node movement increases quorum and due to node movement being uncertain, there is bigger possibility that something goes wrong and quorum nodes cannot be reached.

        Show
        Jaakko Laine added a comment - This issue is not only related to bootstrapping, since nodes leaving the ring will also cause pending ranges. If a node does not complete leaving operation properly, obsolete pending ranges will be left in metadata. (2) above is actually almost exactly how pending ranges is calculated. All current move operations are finished and pending ranges is calculated according to what are the new natural endpoints for the ranges in question. This is not directly related to bootstrapping IMHO, but to the fact that node movement increases quorum and due to node movement being uncertain, there is bigger possibility that something goes wrong and quorum nodes cannot be reached.
        Hide
        Jonathan Ellis added a comment -

        To clarify: the #722 fix breaks the design because a bootstrapping node that goes "down" temporarily but completes bootstrap will not actually have all the writes that happened during bootstrap on it.

        Show
        Jonathan Ellis added a comment - To clarify: the #722 fix breaks the design because a bootstrapping node that goes "down" temporarily but completes bootstrap will not actually have all the writes that happened during bootstrap on it.

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Jonathan Ellis
            Reviewer:
            Sylvain Lebresne
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development