Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19879

distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest fails sometimes

    XMLWordPrintableJSON

Details

    • Low
    • All
    • None

    Description

      org.apache.cassandra.distributed.test.ring.BootstrapTest#bootstrapUnspecifiedResumeTest JUnit test may fail rarely with NPE:

       java.lang.NullPointerException: Cannot invoke "org.apache.cassandra.gms.EndpointState.getApplicationState(org.apache.cassandra.gms.ApplicationState)" because "state" is null
      
      	at org.apache.cassandra.distributed.action.GossipHelper$PullSchemaFrom.lambda$accept$6adea493$1(GossipHelper.java:245)
      	at org.apache.cassandra.distributed.impl.IsolatedExecutor.lambda$async$10(IsolatedExecutor.java:156)
      	at org.apache.cassandra.concurrent.FutureTask$2.call(FutureTask.java:124)
      	at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)
      	at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      	at java.base/java.lang.Thread.run(Thread.java:840)

      Observed during testing of CASSANDRA-19651
      It is not reproduced easily.
      As a part of Instance.startup org.apache.cassandra.gms.Gossiper#waitToSettle waits for 5 +3 x 1 = 8 seconds if there are no changes in the number of nodes discovered using gossip (even if we have not had any interactions with other nodes using gossip at all).
      I have added a 5-second sleep to org.apache.cassandra.gms.Gossiper.GossipTask#run (we also have 1 second of initial delay when we schedule GossipTask)

       private class GossipTask implements Runnable
          {
              public void run()
              {
                  try
                  {
                      //wait on messaging service to start listening
                      MessagingService.instance().waitUntilListening();
                      Thread.sleep(5000); // <===============================
      
                      taskLock.lock();
      

      and have got the NPE reproduced more frequently.
      So, it looks like the test may fail if by some reason GossipTask haven't had a chance to run before EndpointState.getApplicationState is invoked as a part of the test logic.

      Note: In 5.1 the test is different and does not have pullSchemaFrom logic at all.
      A conversation about the issue was started in CASSANDRA-19651

      Attachments

        Activity

          People

            Unassigned Unassigned
            dnk Dmitry Konstantinov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: