Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-4892

CI failure: geode-wan distributedTest hung in org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember Ended

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.6.0
    • None

    Description

      geode-wan distributed tests hung after 8 hours (usually takes 5-6 hours)

      https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/207

      The progress files show this DistributedTest run starting ~4 am ... but it also shows this test completing at 9:05 ... which is only 5 hours. Are we somehow stuck in cache close for 3 hours before the hang is declared while trying to shut down the DUnit VM.

      dunit-hangs.txt

      Started @ 2018-03-17 03:57:50.567 +0000
      Ended @ 2018-03-17 03:59:28.268 +0000
      
      Started @ 2018-03-17 04:00:05.872 +0000
      Ended @ 2018-03-17 04:52:41.879 +0000
      
      Started @ 2018-03-17 04:53:00.358 +0000
      Ended @ 2018-03-17 04:55:07.622 +0000
      
      Started @ 2018-03-17 04:55:33.550 +0000
      Ended @ 2018-03-17 07:34:11.639 +0000
      
      Started @ 2018-03-17 07:34:51.022 +0000
      Ended @ 2018-03-17 08:08:34.849 +0000
      
      Started @ 2018-03-17 08:09:33.853 +0000
      Ended @ 2018-03-17 08:31:43.646 +0000
      
      Started @ 2018-03-17 08:31:54.175 +0000
      Ended @ 2018-03-17 08:32:02.523 +0000
      
      Started @ 2018-03-17 08:32:13.946 +0000
      Ended @ 2018-03-17 08:32:34.626 +0000
      
      Started @ 2018-03-17 08:33:20.100 +0000
      2018-03-17 08:56:15.914 +0000  org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember
      Ended @ 2018-03-17 09:05:29.197 +0000
      

      Last distributedTest running:

      Started @ 2018-03-17 08:33:20.100 +0000
      2018-03-17 08:56:15.914 +0000  org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember
      Ended @ 2018-03-17 09:05:29.197 +0000
      

      Stack dumps are taken at 5-10 second intervals and they all look the same ... so it looks like 433 is stuck in waiting during cache close => distributedLockServiceDestroys waiting for replies from 493 for ElderInit. 493 is in ClusterDistributionManager.elderChangeWait().

      Dumping stack for process 433:
      2018-03-17 10:55:59

      "RMI TCP Connection(1)-172.17.0.3" #30 daemon prio=5 os_prio=0 tid=0x00007f9760001800 nid=0x1ec waiting on condition [0x00007f975bffb000]
         java.lang.Thread.State: TIMED_WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000000e08dd870> (a java.util.concurrent.CountDownLatch$Sync)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
              at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
              at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
              at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766)
              at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853)
              at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72)
              at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderStateWithTryLock(ClusterDistributionManager.java:3359)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:3309)
              at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:238)
              at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:347)
              at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:327)
              at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:318)
              at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:872)
              at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1227)
              - locked <0x00000000e09a8768> (a org.apache.geode.distributed.internal.locks.DLockGrantor)
              at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:646)
              at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2358)
              at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2276)
              - locked <0x00000000e040b400> (a java.lang.Object)
              at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2214)
              at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84)
              at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderImpl.stop(SerialGatewaySenderImpl.java:136)
              at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2169)
              - locked <0x00000000e03ddf60> (a java.lang.Class for org.apache.geode.internal.cache.GemFireCacheImpl)
              at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1961)
              at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1951)
              at org.apache.geode.test.junit.rules.ServerStarterRule.stopMember(ServerStarterRule.java:102)
              at org.apache.geode.test.junit.rules.MemberStarterRule.after(MemberStarterRule.java:83)
              at org.apache.geode.test.dunit.rules.ClusterStartupRule.stopElementInsideVM(ClusterStartupRule.java:326)
              at org.apache.geode.test.junit.rules.VMProvider.lambda$stopVM$fe0d42dc$1(VMProvider.java:35)
              at org.apache.geode.test.junit.rules.VMProvider$$Lambda$79/501183203.run(Unknown Source)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
      
                      • Dumping stack for process 493:
                        2018-03-17 10:56:02
      "Pooled Message Processor 1" #179 daemon prio=10 os_prio=0 tid=0x00007f3640002800 nid=0x763 in Object.wait() [0x00007f35edd65000]
         java.lang.Thread.State: WAITING (on object monitor)
              at java.lang.Object.wait(Native Method)
              at java.lang.Object.wait(Object.java:502)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.elderChangeWait(ClusterDistributionManager.java:3487)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.waitForElder(ClusterDistributionManager.java:3431)
              - locked <0x00000000e069bde0> (a java.lang.Object)
              at org.apache.geode.distributed.internal.locks.ElderInitProcessor$ElderInitMessage.process(ElderInitProcessor.java:178)
              at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382)
              at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1118)
              at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
              at org.apache.geode.distributed.internal.ClusterDistributionManager$4$1.run(ClusterDistributionManager.java:789)
              at java.lang.Thread.run(Thread.java:748)
      
         Locked ownable synchronizers:
              - <0x00000000e07073a8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
      

      Attachments

        Issue Links

          Activity

            There are no comments yet on this issue.

            People

              Unassigned Unassigned
              ladyvader Lynn Hughes-Godfrey
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: