Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.6.0
-
None
Description
geode-wan distributed tests hung after 8 hours (usually takes 5-6 hours)
https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/207
The progress files show this DistributedTest run starting ~4 am ... but it also shows this test completing at 9:05 ... which is only 5 hours. Are we somehow stuck in cache close for 3 hours before the hang is declared while trying to shut down the DUnit VM.
dunit-hangs.txt
Started @ 2018-03-17 03:57:50.567 +0000 Ended @ 2018-03-17 03:59:28.268 +0000 Started @ 2018-03-17 04:00:05.872 +0000 Ended @ 2018-03-17 04:52:41.879 +0000 Started @ 2018-03-17 04:53:00.358 +0000 Ended @ 2018-03-17 04:55:07.622 +0000 Started @ 2018-03-17 04:55:33.550 +0000 Ended @ 2018-03-17 07:34:11.639 +0000 Started @ 2018-03-17 07:34:51.022 +0000 Ended @ 2018-03-17 08:08:34.849 +0000 Started @ 2018-03-17 08:09:33.853 +0000 Ended @ 2018-03-17 08:31:43.646 +0000 Started @ 2018-03-17 08:31:54.175 +0000 Ended @ 2018-03-17 08:32:02.523 +0000 Started @ 2018-03-17 08:32:13.946 +0000 Ended @ 2018-03-17 08:32:34.626 +0000 Started @ 2018-03-17 08:33:20.100 +0000 2018-03-17 08:56:15.914 +0000 org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember Ended @ 2018-03-17 09:05:29.197 +0000
Last distributedTest running:
Started @ 2018-03-17 08:33:20.100 +0000 2018-03-17 08:56:15.914 +0000 org.apache.geode.internal.cache.wan.wancommand.StatusGatewaySenderCommandDUnitTest testGatewaySenderStatus_OnMember Ended @ 2018-03-17 09:05:29.197 +0000
Stack dumps are taken at 5-10 second intervals and they all look the same ... so it looks like 433 is stuck in waiting during cache close => distributedLockServiceDestroys waiting for replies from 493 for ElderInit. 493 is in ClusterDistributionManager.elderChangeWait().
Dumping stack for process 433:
2018-03-17 10:55:59
"RMI TCP Connection(1)-172.17.0.3" #30 daemon prio=5 os_prio=0 tid=0x00007f9760001800 nid=0x1ec waiting on condition [0x00007f975bffb000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000e08dd870> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853) at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72) at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56) at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderStateWithTryLock(ClusterDistributionManager.java:3359) at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:3309) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:238) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:347) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:327) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:318) at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:872) at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1227) - locked <0x00000000e09a8768> (a org.apache.geode.distributed.internal.locks.DLockGrantor) at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:646) at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2358) at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2276) - locked <0x00000000e040b400> (a java.lang.Object) at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2214) at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84) at org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderImpl.stop(SerialGatewaySenderImpl.java:136) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2169) - locked <0x00000000e03ddf60> (a java.lang.Class for org.apache.geode.internal.cache.GemFireCacheImpl) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1961) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1951) at org.apache.geode.test.junit.rules.ServerStarterRule.stopMember(ServerStarterRule.java:102) at org.apache.geode.test.junit.rules.MemberStarterRule.after(MemberStarterRule.java:83) at org.apache.geode.test.dunit.rules.ClusterStartupRule.stopElementInsideVM(ClusterStartupRule.java:326) at org.apache.geode.test.junit.rules.VMProvider.lambda$stopVM$fe0d42dc$1(VMProvider.java:35) at org.apache.geode.test.junit.rules.VMProvider$$Lambda$79/501183203.run(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498)
-
-
-
-
-
-
-
-
- Dumping stack for process 493:
2018-03-17 10:56:02
- Dumping stack for process 493:
-
-
-
-
-
-
-
"Pooled Message Processor 1" #179 daemon prio=10 os_prio=0 tid=0x00007f3640002800 nid=0x763 in Object.wait() [0x00007f35edd65000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.geode.distributed.internal.ClusterDistributionManager.elderChangeWait(ClusterDistributionManager.java:3487) at org.apache.geode.distributed.internal.ClusterDistributionManager.waitForElder(ClusterDistributionManager.java:3431) - locked <0x00000000e069bde0> (a java.lang.Object) at org.apache.geode.distributed.internal.locks.ElderInitProcessor$ElderInitMessage.process(ElderInitProcessor.java:178) at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:382) at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:448) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1118) at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109) at org.apache.geode.distributed.internal.ClusterDistributionManager$4$1.run(ClusterDistributionManager.java:789) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000000e07073a8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
Attachments
Issue Links
- duplicates
-
GEODE-4650 DLockService.clearGrantor can potentially hang
- Resolved