Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Fix Version/s: 0.7.1
    • Component/s: None
    • Labels:
      None

      Description

      Occasionally when decommissioning a node, there is a race condition that occurs where another node will never remove the token and thus propagate it again with a state of down. With CASSANDRA-1900 we can solve this, but it shouldn't occur in the first place.

      Given nodes A, B, and C, if you decommission B it will stream to A and C. When complete, B will decommission and receive this stacktrace:

      ERROR 00:02:40,282 Fatal exception in thread Thread[Thread-5,5,main]
      java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down
      at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:62)
      at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
      at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
      at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:387)
      at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91

      At this point A will show it is removing B's token, but C will not and instead its failure detector will report that B is dead, and nodetool ring on C shows B in a leaving/down state. In another gossip round, C will propagate this state back to A.

        Activity

        Hide
        brandon.williams Brandon Williams added a comment -

        Here is what is happening:

        B sends LEFT to C, C calls removeEndpoint and drops the endpoint state. B never gets to send to A (because it only waits 2s to announce, which can be just one round) and A still thinks it's LEAVING. C sees B in a gossip digest from A, and not knowing anything about it, calls requestAll, but A refuses to tell C anything about it because A has B in justRemovedEndpoints. Eventually, QUARANTINE_DELAY expires and A unhelpfully propagates the LEAVING state back to C.

        The obvious solution here is that B should announce LEFT for RING_DELAY, simply because it's the right thing to do as opposed to a one-off delay of 2 seconds.

        However, this exposes a more subtle problem. When removeEndpoint is called, we drop the state right away and track the endpoint in justRemovedEndpoints. Instead, we should hold on to the state so it is still propagated in further gossip digests, and expire it when we expire justRemovedEndpoints.

        Either of these changes is technically enough to solve this issue, but both together add an extra safeguard. Changing where we expire the endpoint state is the more impacting of the two, however the gossip generation and version checks always prevent any negative consequences from doing this.

        Show
        brandon.williams Brandon Williams added a comment - Here is what is happening: B sends LEFT to C, C calls removeEndpoint and drops the endpoint state. B never gets to send to A (because it only waits 2s to announce, which can be just one round) and A still thinks it's LEAVING. C sees B in a gossip digest from A, and not knowing anything about it, calls requestAll, but A refuses to tell C anything about it because A has B in justRemovedEndpoints. Eventually, QUARANTINE_DELAY expires and A unhelpfully propagates the LEAVING state back to C. The obvious solution here is that B should announce LEFT for RING_DELAY, simply because it's the right thing to do as opposed to a one-off delay of 2 seconds. However, this exposes a more subtle problem. When removeEndpoint is called, we drop the state right away and track the endpoint in justRemovedEndpoints. Instead, we should hold on to the state so it is still propagated in further gossip digests, and expire it when we expire justRemovedEndpoints. Either of these changes is technically enough to solve this issue, but both together add an extra safeguard. Changing where we expire the endpoint state is the more impacting of the two, however the gossip generation and version checks always prevent any negative consequences from doing this.
        Hide
        gdusbabek Gary Dusbabek added a comment -

        +1

        Show
        gdusbabek Gary Dusbabek added a comment - +1
        Hide
        brandon.williams Brandon Williams added a comment -

        Committed.

        Show
        brandon.williams Brandon Williams added a comment - Committed.
        Hide
        hudson Hudson added a comment -

        Integrated in Cassandra-0.7 #244 (See https://hudson.apache.org/hudson/job/Cassandra-0.7/244/)
        Fix race condition during decommission by announcing for RING_DELAY and
        not removing endpoint state until removing the ep from
        justRemovedEndpoints.
        Patch by brandonwilliams, reviewed by gdusbabek for CASSANDRA-2072

        Show
        hudson Hudson added a comment - Integrated in Cassandra-0.7 #244 (See https://hudson.apache.org/hudson/job/Cassandra-0.7/244/ ) Fix race condition during decommission by announcing for RING_DELAY and not removing endpoint state until removing the ep from justRemovedEndpoints. Patch by brandonwilliams, reviewed by gdusbabek for CASSANDRA-2072

          People

          • Assignee:
            brandon.williams Brandon Williams
            Reporter:
            brandon.williams Brandon Williams
            Reviewer:
            Gary Dusbabek
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development