[GEODE-2162] AsyncEventListenerDUnitTest testParallelAsyncEventQueueWithOneAccessor hang on tearDown - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: None
Fix Version/s: None
Component/s: distributed lock service
Labels:
None

Description

"RMI TCP Connection(10)-10.118.32.97" #27 daemon prio=5 os_prio=0 tid=0x00007f4148001800 nid=0x753f waiting on condition [0x00007f41ad02d000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

parking to wait for <0x00000000e090e1b8> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:716)
at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:793)
at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:769)
at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:856)
at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72)
at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56)
at org.apache.geode.distributed.internal.DistributionManager.getElderStateWithTryLock(DistributionManager.java:3631)
at org.apache.geode.distributed.internal.DistributionManager.getElderState(DistributionManager.java:3581)
at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:240)
at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:349)
at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:329)
at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:320)
at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:850)
at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1226)
locked <0x00000000e09ee050> (a org.apache.geode.distributed.internal.locks.DLockGrantor)
at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:640)
at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2496)
at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2415)
locked <0x00000000e0554e08> (a java.lang.Object)
at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2316)
at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84)
at org.apache.geode.internal.cache.GemFireCacheImpl.destroyGatewaySenderLockService(GemFireCacheImpl.java:2053)
at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2196)
locked <0x00000000e0347638> (a java.lang.Class for org.apache.geode.internal.cache.GemFireCacheImpl)
at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1973)
at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1969)
at org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.closeCache(JUnit4DistributedTestCase.java:602)
at org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.tearDownVM(JUnit4DistributedTestCase.java:563)
at org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase.lambda$cleanupAllVms$bb17a952$1(JUnit4DistributedTestCase.java:554)
at org.apache.geode.test.dunit.internal.JUnit4DistributedTestCase$$Lambda$23/2043837261.run(Unknown Source)
at org.apache.geode.test.dunit.NamedRunnable.run(NamedRunnable.java:31)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at hydra.MethExecutor.executeObject(MethExecutor.java:245)
at org.apache.geode.test.dunit.standalone.RemoteDUnitVM.executeMethodOnObject(RemoteDUnitVM.java:73)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$95(TCPTransport.java:683)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$1/702822778.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Attachments

Activity

Ascending order - Click to sort in descending order

Hitesh Khamesra added a comment - 01/Dec/16 22:58

Here is stack on member, which is not responding

Pooled Message Processor 1" #87 daemon prio=10 os_prio=0 tid=0x00007fc188014800 nid=0x1a48 in Object.wait() [0x00007fc137ffe000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)

waiting on <0x00000000e033afe8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at org.apache.geode.distributed.internal.DistributionManager.elderChangeWait(DistributionManager.java:3751)
at org.apache.geode.distributed.internal.DistributionManager.waitForElder(DistributionManager.java:3695)
locked <0x00000000e033afe8> (a java.lang.Object)
at org.apache.geode.distributed.internal.locks.ElderInitProcessor$ElderInitMessage.process(ElderInitProcessor.java:177)
at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376)
at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:442)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:621)
at org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:878)
at java.lang.Thread.run(Thread.java:745)

Hitesh Khamesra added a comment - 01/Dec/16 22:58 Here is stack on member, which is not responding Pooled Message Processor 1" #87 daemon prio=10 os_prio=0 tid=0x00007fc188014800 nid=0x1a48 in Object.wait() [0x00007fc137ffe000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0x00000000e033afe8> (a java.lang.Object) at java.lang.Object.wait(Object.java:502) at org.apache.geode.distributed.internal.DistributionManager.elderChangeWait(DistributionManager.java:3751) at org.apache.geode.distributed.internal.DistributionManager.waitForElder(DistributionManager.java:3695) locked <0x00000000e033afe8> (a java.lang.Object) at org.apache.geode.distributed.internal.locks.ElderInitProcessor$ElderInitMessage.process(ElderInitProcessor.java:177) at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:376) at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:442) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:621) at org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:878) at java.lang.Thread.run(Thread.java:745)

Hitesh Khamesra added a comment - 01/Dec/16 22:59

Here are the concerned logs

[info 2016/11/30 20:09:30.595 PST <Pooled High Priority Message Processor 2> tid=0x4a] received leave request from 10.118.32.97(29824)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770

[info 2016/11/30 20:09:30.595 PST <Pooled High Priority Message Processor 2> tid=0xcde] Member at 10.118.32.97(29824)<ec><v0>:32770 gracefully left the distributed cache: shutdown message received

[info 2016/11/30 20:09:30.596 PST <Geode Membership View Creator> tid=0xce6] preparing new view View[10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771

{lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770]
failure detection ports: 56090 51475 32882 48010

[info 2016/11/30 20:09:30.599 PST <Geode Membership View Creator> tid=0xce6] finished waiting for responses to view preparation

[info 2016/11/30 20:09:30.599 PST <Geode Membership View Creator> tid=0xce6] received new view: View[10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771{lead}

, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770]
old view is: View[10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770

{lead}, 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774]

[info 2016/11/30 20:09:30.600 PST <Geode Membership View Creator> tid=0xce6] sending new view View[10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771{lead}

, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770]
failure detection ports: 56090 51475 32882 48010

[info 2016/11/30 20:09:30.601 PST <View Message Processor> tid=0xcd0] I, 10.118.32.97(29828)<v1>:32771, am the elder.

[info 2016/11/30 20:09:30.601 PST <unicast receiver,poland-65368> tid=0x4a7] received new view: View[10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771

{lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770]
old view is: View[10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770{lead}

, 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774]

[info 2016/11/30 20:09:30.601 PST <unicast receiver,poland-51092> tid=0x126] received new view: View[10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771

{lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770]
old view is: View[10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770{lead}

, 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774]

[info 2016/11/30 20:09:30.601 PST <unicast receiver,poland-51998> tid=0x29] received new view: View[10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771

{lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770]
old view is: View[10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770{lead}

, 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774]

[info 2016/11/30 20:09:30.607 PST <Pooled High Priority Message Processor 2> tid=0x4a] Member at 10.118.32.97(29824)<ec><v0>:32770 gracefully left the distributed cache: shutdown message received

[info 2016/11/30 20:09:30.694 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Now closing distribution for 10.118.32.97(29824)<ec><v0>:32770

[info 2016/11/30 20:09:30.694 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Stopping membership services

[info 2016/11/30 20:09:30.697 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] GMSHealthMonitor server socket is closed in stopServices().

[info 2016/11/30 20:09:30.697 PST <Geode Failure Detection Server thread 0> tid=0xb46] GMSHealthMonitor server thread exiting

[info 2016/11/30 20:09:30.697 PST <unicast receiver,poland-62793> tid=0xcba] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770

[info 2016/11/30 20:09:30.697 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] GMSHealthMonitor serverSocketExecutor is terminated

[info 2016/11/30 20:09:30.697 PST <unicast receiver,poland-65368> tid=0x4a7] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770

[info 2016/11/30 20:09:30.698 PST <unicast receiver,poland-51092> tid=0x126] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770

[info 2016/11/30 20:09:30.697 PST <unicast receiver,poland-51998> tid=0x29] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770

[info 2016/11/30 20:09:30.703 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] DistributionManager stopped in 111ms.

[info 2016/11/30 20:09:30.703 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Marking DistributionManager 10.118.32.97(29824)<ec><v0>:32770 as closed.

[info 2016/11/30 20:09:30.704 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Stopping Distribution Locator on localhost/127.0.0.1[26789]

[info 2016/11/30 20:09:30.705 PST <Distribution Locator on localhost/127.0.0.1[null]> tid=0xb36] locator shutting down

[info 2016/11/30 20:09:30.707 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Distribution Locator on localhost/127.0.0.1[0] is stopped

[info 2016/11/30 20:09:30.707 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Got result: null
from org.apache.geode.test.dunit.NamedRunnable.run with 0 args on object: runnable(tearDownVM) (took 116 ms)

[info 2016/11/30 20:09:30.709 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] Received method: org.apache.geode.test.dunit.NamedRunnable.run with 0 args on object: runnable(tearDownVM)

[info 2016/11/30 20:09:30.713 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] GemFireCache[id = 2145939405; isClosing = true; isShutDownAll = false; created = Wed Nov 30 20:09:25 PST 2016; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.

[info 2016/11/30 20:09:30.954 PST <BatchRemovalThread> tid=0xce1] The QueueRemovalThread is done.

[info 2016/11/30 20:09:30.955 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] Stopped ParallelGatewaySender

{id=AsyncEventQueue_ln,remoteDsId=-1,isRunning =false}

[info 2016/11/30 20:09:30.959 PST <Pooled Message Processor 1> tid=0x57] Changing Elder from 10.118.32.97(29824)<ec><v0>:32770 to 10.118.32.97(29828)<v1>:32771.

[warn 2016/11/30 20:09:45.957 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] 15 seconds have elapsed while waiting for replies: <ElderInitProcessor 12712 waiting for 1 replies from [10.118.32.97(6382)<v4>:32774]> on 10.118.32.97(29828)<v1>:32771 whose current membership list is: [[10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774, 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772]]

Hitesh Khamesra added a comment - 01/Dec/16 22:59 Here are the concerned logs [info 2016/11/30 20:09:30.595 PST <Pooled High Priority Message Processor 2> tid=0x4a] received leave request from 10.118.32.97(29824)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770 [info 2016/11/30 20:09:30.595 PST <Pooled High Priority Message Processor 2> tid=0xcde] Member at 10.118.32.97(29824)<ec><v0>:32770 gracefully left the distributed cache: shutdown message received [info 2016/11/30 20:09:30.596 PST <Geode Membership View Creator> tid=0xce6] preparing new view View [10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771 {lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770] failure detection ports: 56090 51475 32882 48010 [info 2016/11/30 20:09:30.599 PST <Geode Membership View Creator> tid=0xce6] finished waiting for responses to view preparation [info 2016/11/30 20:09:30.599 PST <Geode Membership View Creator> tid=0xce6] received new view: View [10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771{lead} , 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770] old view is: View [10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770 {lead}, 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] [info 2016/11/30 20:09:30.600 PST <Geode Membership View Creator> tid=0xce6] sending new view View [10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771{lead} , 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770] failure detection ports: 56090 51475 32882 48010 [info 2016/11/30 20:09:30.601 PST <View Message Processor> tid=0xcd0] I, 10.118.32.97(29828)<v1>:32771, am the elder. [info 2016/11/30 20:09:30.601 PST <unicast receiver,poland-65368> tid=0x4a7] received new view: View [10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771 {lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770] old view is: View [10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770{lead} , 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] [info 2016/11/30 20:09:30.601 PST <unicast receiver,poland-51092> tid=0x126] received new view: View [10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771 {lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770] old view is: View [10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770{lead} , 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] [info 2016/11/30 20:09:30.601 PST <unicast receiver,poland-51998> tid=0x29] received new view: View [10.118.32.97(29828)<v1>:32771|13] members: [10.118.32.97(29828)<v1>:32771 {lead}, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] shutdown: [10.118.32.97(29824)<ec><v0>:32770] old view is: View [10.118.32.97(29824)<ec><v0>:32770|4] members: [10.118.32.97(29824)<ec><v0>:32770{lead} , 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772, 10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774] [info 2016/11/30 20:09:30.607 PST <Pooled High Priority Message Processor 2> tid=0x4a] Member at 10.118.32.97(29824)<ec><v0>:32770 gracefully left the distributed cache: shutdown message received [info 2016/11/30 20:09:30.694 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Now closing distribution for 10.118.32.97(29824)<ec><v0>:32770 [info 2016/11/30 20:09:30.694 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Stopping membership services [info 2016/11/30 20:09:30.697 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] GMSHealthMonitor server socket is closed in stopServices(). [info 2016/11/30 20:09:30.697 PST <Geode Failure Detection Server thread 0> tid=0xb46] GMSHealthMonitor server thread exiting [info 2016/11/30 20:09:30.697 PST <unicast receiver,poland-62793> tid=0xcba] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770 [info 2016/11/30 20:09:30.697 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] GMSHealthMonitor serverSocketExecutor is terminated [info 2016/11/30 20:09:30.697 PST <unicast receiver,poland-65368> tid=0x4a7] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770 [info 2016/11/30 20:09:30.698 PST <unicast receiver,poland-51092> tid=0x126] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770 [info 2016/11/30 20:09:30.697 PST <unicast receiver,poland-51998> tid=0x29] received leave request from 10.118.32.97(:<unknown:0>)<ec><v0>:32770 for 10.118.32.97(29824)<ec><v0>:32770 [info 2016/11/30 20:09:30.703 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] DistributionManager stopped in 111ms. [info 2016/11/30 20:09:30.703 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Marking DistributionManager 10.118.32.97(29824)<ec><v0>:32770 as closed. [info 2016/11/30 20:09:30.704 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Stopping Distribution Locator on localhost/127.0.0.1 [26789] [info 2016/11/30 20:09:30.705 PST <Distribution Locator on localhost/127.0.0.1 [null] > tid=0xb36] locator shutting down [info 2016/11/30 20:09:30.707 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Distribution Locator on localhost/127.0.0.1 [0] is stopped [info 2016/11/30 20:09:30.707 PST <RMI TCP Connection(13)-10.118.32.97> tid=0x8b] Got result: null from org.apache.geode.test.dunit.NamedRunnable.run with 0 args on object: runnable(tearDownVM) (took 116 ms) [info 2016/11/30 20:09:30.709 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] Received method: org.apache.geode.test.dunit.NamedRunnable.run with 0 args on object: runnable(tearDownVM) [info 2016/11/30 20:09:30.713 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] GemFireCache [id = 2145939405; isClosing = true; isShutDownAll = false; created = Wed Nov 30 20:09:25 PST 2016; server = false; copyOnRead = false; lockLease = 120; lockTimeout = 60] : Now closing. [info 2016/11/30 20:09:30.954 PST <BatchRemovalThread> tid=0xce1] The QueueRemovalThread is done. [info 2016/11/30 20:09:30.955 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] Stopped ParallelGatewaySender {id=AsyncEventQueue_ln,remoteDsId=-1,isRunning =false} [info 2016/11/30 20:09:30.959 PST <Pooled Message Processor 1> tid=0x57] Changing Elder from 10.118.32.97(29824)<ec><v0>:32770 to 10.118.32.97(29828)<v1>:32771. [warn 2016/11/30 20:09:45.957 PST <RMI TCP Connection(10)-10.118.32.97> tid=0x1b] 15 seconds have elapsed while waiting for replies: <ElderInitProcessor 12712 waiting for 1 replies from [10.118.32.97(6382)<v4>:32774] > on 10.118.32.97(29828)<v1>:32771 whose current membership list is: [ [10.118.32.97(29837)<v3>:32773, 10.118.32.97(6382)<v4>:32774, 10.118.32.97(29828)<v1>:32771, 10.118.32.97(29833)<v2>:32772] ]

Hitesh Khamesra added a comment - 01/Dec/16 23:00

Seems like member has not recognized yet that elder has been changed.

Hitesh Khamesra added a comment - 01/Dec/16 23:00 Seems like member has not recognized yet that elder has been changed.

Bruce J Schuchardt added a comment - 30/Dec/16 21:52

We really needed to know what the state of 10.118.32.97(6382)<v4>:32774 was. It's odd that the output Hitesh put in his comment doesn't have the dunit VM of origin in the log messages. When I run this test I see [vm_0], [vm+1], [vm_2] etc at the beginning of each log statement. Nothing is run in the controller VM so there's nothing logged by it.

I've run this test over 1000 times with various prods to make sure that the member who takes over from the locator becomes the elder and lock grantor but I've been unable to reproduce this hang, or even a reply-wait timeout. I added hooks in Elder initialization code to make sure that the same scenario was being run but it always worked as designed.

I'm closing this ticket. If it happens again we need a more thorough collection of artifacts - at least the full log of the run including output from all of the VMs.

Bruce J Schuchardt added a comment - 30/Dec/16 21:52 We really needed to know what the state of 10.118.32.97(6382)<v4>:32774 was. It's odd that the output Hitesh put in his comment doesn't have the dunit VM of origin in the log messages. When I run this test I see [vm_0] , [vm+1] , [vm_2] etc at the beginning of each log statement. Nothing is run in the controller VM so there's nothing logged by it. I've run this test over 1000 times with various prods to make sure that the member who takes over from the locator becomes the elder and lock grantor but I've been unable to reproduce this hang, or even a reply-wait timeout. I added hooks in Elder initialization code to make sure that the same scenario was being run but it always worked as designed. I'm closing this ticket. If it happens again we need a more thorough collection of artifacts - at least the full log of the run including output from all of the VMs.

Geode

AsyncEventListenerDUnitTest testParallelAsyncEventQueueWithOneAccessor hang on tearDown

Details

Description

Attachments

Activity

People

Dates