Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-3832

gossip stage backed up due to migration manager future de-ref

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Fix Version/s: 1.1.0
    • Component/s: None
    • Labels:
      None

      Description

      This is just bootstrapping a ~ 180 trunk cluster. After a while, a
      node I was on was stuck with thinking all nodes are down, because
      gossip stage was backed up, because it was spending a long time
      (multiple seconds or more, I suppose RPC timeout maybe) doing the
      following. Cluster-wide restart -> back to normal. I have not
      investigated further.

      "GossipStage:1" daemon prio=10 tid=0x00007f9d5847a800 nid=0xa6fc waiting on condition [0x000000004345f000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000005029ad1c0> (a java.util.concurrent.FutureTask$Sync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
      	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
      	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
      	at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:364)
      	at org.apache.cassandra.service.MigrationManager.rectifySchema(MigrationManager.java:132)
      	at org.apache.cassandra.service.MigrationManager.onAlive(MigrationManager.java:75)
      	at org.apache.cassandra.gms.Gossiper.markAlive(Gossiper.java:802)
      	at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:918)
      	at org.apache.cassandra.gms.GossipDigestAckVerbHandler.doVerb(GossipDigestAckVerbHandler.java:68)
      

        Issue Links

          Activity

          Hide
          scode Peter Schuller added a comment -

          This happens w/o bootstrap too. I just took the same ~ 180 node cluster on 0.8 and deployed trunk on it. All nodes restarted almost at the same time, but no one is in any state but Normal. It's sitting there not really coming up (probably is, but slowly) despite this just being a full cluster restart effectively.

          Show
          scode Peter Schuller added a comment - This happens w/o bootstrap too. I just took the same ~ 180 node cluster on 0.8 and deployed trunk on it. All nodes restarted almost at the same time, but no one is in any state but Normal. It's sitting there not really coming up (probably is, but slowly) despite this just being a full cluster restart effectively.
          Hide
          scode Peter Schuller added a comment -

          Meanwhile, MigrationStage is stuck like this:

          "MigrationStage:1" daemon prio=10 tid=0x00007fb5b450e800 nid=0x3395 waiting on condition [0x0000000043479000]
             java.lang.Thread.State: TIMED_WAITING (parking)
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for  <0x00000005032ed688> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
          	at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:61)
          	at org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:119)
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
          	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          

          The GossipStage submits the job on the migration state on the local node and waits for the result. The migration stage in turn sends a message and waits for the response synchronously.

          The migration request runs on the migration stage on the remote node, which is presumably stuck with it's own task on the migration stage.

          In effect, we are causing a distributed deadlock (or almost deadlock, I'm not sure - I suppose we might get unstuck eventually since things do time out after rpc timeout).

          Show
          scode Peter Schuller added a comment - Meanwhile, MigrationStage is stuck like this: "MigrationStage:1" daemon prio=10 tid=0x00007fb5b450e800 nid=0x3395 waiting on condition [0x0000000043479000] java.lang. Thread .State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000005032ed688> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116) at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:61) at org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:119) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) The GossipStage submits the job on the migration state on the local node and waits for the result. The migration stage in turn sends a message and waits for the response synchronously. The migration request runs on the migration stage on the remote node, which is presumably stuck with it's own task on the migration stage. In effect, we are causing a distributed deadlock (or almost deadlock, I'm not sure - I suppose we might get unstuck eventually since things do time out after rpc timeout).
          Hide
          scode Peter Schuller added a comment -

          Though in this case so far as I was looking at code, the cluster is not showing signs of recovering. Essentially, the cluster is in a "fallen and can't get up" state.

          Show
          scode Peter Schuller added a comment - Though in this case so far as I was looking at code, the cluster is not showing signs of recovering. Essentially, the cluster is in a "fallen and can't get up" state.
          Hide
          jbellis Jonathan Ellis added a comment -

          It looks like the intent of the wait is to avoid sending more than one request to the same target for the same version. One possible fix would be to replace the wait with a Set of in-progress requests a la HHOM.

          Show
          jbellis Jonathan Ellis added a comment - It looks like the intent of the wait is to avoid sending more than one request to the same target for the same version. One possible fix would be to replace the wait with a Set of in-progress requests a la HHOM.
          Hide
          scode Peter Schuller added a comment -

          Attaching simple patch to just not wait on the future. Given that we have no special code path to handle timeouts anyway, this does not introduce any actual lack of failure handling beyond what is already there, so as far as I can tell it should not cause any failure to reach schema agreement that we would not already be vulnerable to.

          Also upping priority since this bug causes clusters to refuse to start up even with full cluster re-starts by the operator.

          Show
          scode Peter Schuller added a comment - Attaching simple patch to just not wait on the future. Given that we have no special code path to handle timeouts anyway, this does not introduce any actual lack of failure handling beyond what is already there, so as far as I can tell it should not cause any failure to reach schema agreement that we would not already be vulnerable to. Also upping priority since this bug causes clusters to refuse to start up even with full cluster re-starts by the operator.
          Hide
          scode Peter Schuller added a comment -

          (Sorry about the e-mail spam I just generated, there are race conditions in the JIRA UI and the 'submit patch' button exhcnages places with start/stop progress...)

          Show
          scode Peter Schuller added a comment - (Sorry about the e-mail spam I just generated, there are race conditions in the JIRA UI and the 'submit patch' button exhcnages places with start/stop progress...)
          Hide
          scode Peter Schuller added a comment -

          jbellis: Ok, that seems plausible (about the intent). I wonder if it is worth it though? What are the negative consequences of multiple requests beyond the CPU usage?

          Show
          scode Peter Schuller added a comment - jbellis: Ok, that seems plausible (about the intent). I wonder if it is worth it though? What are the negative consequences of multiple requests beyond the CPU usage?
          Hide
          scode Peter Schuller added a comment -

          I will submit a patch that does it and you can decide.

          Show
          scode Peter Schuller added a comment - I will submit a patch that does it and you can decide.
          Hide
          jbellis Jonathan Ellis added a comment -

          committed

          Show
          jbellis Jonathan Ellis added a comment - committed
          Hide
          scode Peter Schuller added a comment -

          A concern here though is that in order to make this scale on very large clusters, you probably want to limit the amount of schema migration attempts that are in progress for a given schema version, and not just limit the amount of outstanding for a single node.

          But doing that requires complicating the code so that we don't fail to migrate to a new schema just because one node happened to go down just after we were notified of 500 nodes being e.g. alive and having a schema we don't recognize.

          For now, I will proceed with avoiding duplicate (endpoint, schema) pairs rather than global throttling.

          Show
          scode Peter Schuller added a comment - A concern here though is that in order to make this scale on very large clusters, you probably want to limit the amount of schema migration attempts that are in progress for a given schema version, and not just limit the amount of outstanding for a single node. But doing that requires complicating the code so that we don't fail to migrate to a new schema just because one node happened to go down just after we were notified of 500 nodes being e.g. alive and having a schema we don't recognize. For now, I will proceed with avoiding duplicate (endpoint, schema) pairs rather than global throttling.
          Hide
          scode Peter Schuller added a comment -

          My last comment was made before reading "committed", sorry.

          Show
          scode Peter Schuller added a comment - My last comment was made before reading "committed", sorry.
          Hide
          scode Peter Schuller added a comment -

          If you still want the augmented patch, let me know.

          Show
          scode Peter Schuller added a comment - If you still want the augmented patch, let me know.
          Hide
          jbellis Jonathan Ellis added a comment -

          I'm happy to wait and see if it's actually a problem, before fixing it.

          Show
          jbellis Jonathan Ellis added a comment - I'm happy to wait and see if it's actually a problem, before fixing it.
          Hide
          scode Peter Schuller added a comment -

          Something is still fishy. A node that I did a decommission+bootstrap on is not finishing bootstrap and perpetually claiming to wait for schema, with migration stage backed up like below. Investigating.

          "MigrationStage:1" daemon prio=10 tid=0x00007f63ac0bb000 nid=0xc8a waiting on condition [0x000000004355b000]
             java.lang.Thread.State: TIMED_WAITING (parking)
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for  <0x0000000508776fa8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
          	at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:61)
          	at org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:122)
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
          	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          	at java.lang.Thread.run(Thread.java:662)
          
          Show
          scode Peter Schuller added a comment - Something is still fishy. A node that I did a decommission+bootstrap on is not finishing bootstrap and perpetually claiming to wait for schema, with migration stage backed up like below. Investigating. "MigrationStage:1" daemon prio=10 tid=0x00007f63ac0bb000 nid=0xc8a waiting on condition [0x000000004355b000] java.lang. Thread .State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000508776fa8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116) at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:61) at org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:122) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662)
          Hide
          scode Peter Schuller added a comment -

          And the test is exactly to see whether there is someone active in migration stage:

              public static boolean isReadyForBootstrap()
              {
                  return StageManager.getStage(Stage.MIGRATION).getActiveCount() == 0;
              }
          
          Show
          scode Peter Schuller added a comment - And the test is exactly to see whether there is someone active in migration stage: public static boolean isReadyForBootstrap() { return StageManager.getStage(Stage.MIGRATION).getActiveCount() == 0; }
          Hide
          scode Peter Schuller added a comment -

          It eventually got unstuck, but it took quite some time. I'm not sure why it was having to wait for those responses.

          Show
          scode Peter Schuller added a comment - It eventually got unstuck, but it took quite some time. I'm not sure why it was having to wait for those responses.
          Hide
          scode Peter Schuller added a comment -

          I added logging to see which node it's waiting on a response from, and quickly logged into that node to catch it red handed - it was sitting in the exact same place in the migration manager on the migration stage:

          "MigrationStage:1" daemon prio=10 tid=0x00007f18ec4dc800 nid=0x1d64 waiting on condition [0x0000000043391000]
             java.lang.Thread.State: TIMED_WAITING (parking)
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for  <0x000000050157fdd0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116)
          	at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:61)
          	at org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:124)
          	at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
          	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          	at java.lang.Thread.run(Thread.java:662)
          

          I guess we're triggering distributed deadlock "internally" within the migration stage even though we fixed it so that the gossip stage wouldn't be backed up. If my understanding is correct, this is because all nodes, when a node is marked alive, just know that it has a different schema - not who has the "newer" schema. So when a node joins it gets migration messages from others while it also tries to send migration messages to others and waiting on the response. Whenever it sends a migration message to someone whose migration stage is busy waiting on a response from the node in question - deadlock (until timeout).

          Show
          scode Peter Schuller added a comment - I added logging to see which node it's waiting on a response from, and quickly logged into that node to catch it red handed - it was sitting in the exact same place in the migration manager on the migration stage: "MigrationStage:1" daemon prio=10 tid=0x00007f18ec4dc800 nid=0x1d64 waiting on condition [0x0000000043391000] java.lang. Thread .State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000050157fdd0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2116) at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:61) at org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:124) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) I guess we're triggering distributed deadlock "internally" within the migration stage even though we fixed it so that the gossip stage wouldn't be backed up. If my understanding is correct, this is because all nodes, when a node is marked alive, just know that it has a different schema - not who has the "newer" schema. So when a node joins it gets migration messages from others while it also tries to send migration messages to others and waiting on the response. Whenever it sends a migration message to someone whose migration stage is busy waiting on a response from the node in question - deadlock (until timeout).
          Hide
          scode Peter Schuller added a comment -

          What I'd like in this case is to send a request and register a callback with the response that just puts a task on the migration stage that does the schema merge. That, and keep track of outstanding requests and alter the logic for "is ready to bootstrap".

          I'm still fuzzy on the overall design of messaging; is it me or is there no way to send a request and wait for a response without hogging a thread for it? I want an on-delivery callback, not be waiting for a synchronous get(). A hacky work-around would be to just keep track of all outstanding and get() them in order - it would work, but would be a bit unclean to have one slow request block others from being de-ref:ed.

          Show
          scode Peter Schuller added a comment - What I'd like in this case is to send a request and register a callback with the response that just puts a task on the migration stage that does the schema merge. That, and keep track of outstanding requests and alter the logic for "is ready to bootstrap". I'm still fuzzy on the overall design of messaging; is it me or is there no way to send a request and wait for a response without hogging a thread for it? I want an on-delivery callback, not be waiting for a synchronous get(). A hacky work-around would be to just keep track of all outstanding and get() them in order - it would work, but would be a bit unclean to have one slow request block others from being de-ref:ed.
          Hide
          slebresne Sylvain Lebresne added a comment -

          So if I read this correctly, the initial did fix the initial issue of bootstrap being stuck. I understand it's still not perfect but for the sake of keeping track of changes (in particular what went into 1.1.0), I'm closing this one. Let's fix remaining problems in a separate ticket.

          Show
          slebresne Sylvain Lebresne added a comment - So if I read this correctly, the initial did fix the initial issue of bootstrap being stuck. I understand it's still not perfect but for the sake of keeping track of changes (in particular what went into 1.1.0), I'm closing this one. Let's fix remaining problems in a separate ticket.
          Hide
          scode Peter Schuller added a comment -

          The remaining issue is still causing problems for bootstrap, though not quite as sever as the original problem. Follow-up work filed in CASSANDRA-3882.

          Show
          scode Peter Schuller added a comment - The remaining issue is still causing problems for bootstrap, though not quite as sever as the original problem. Follow-up work filed in CASSANDRA-3882 .

            People

            • Assignee:
              scode Peter Schuller
              Reporter:
              scode Peter Schuller
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development