Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.8.0, 1.9.0
-
None
Description
A locator and a server were spinning up at the same time and the locator became stuck trying to initialize a distributed lock service. Extra logging showed that the server received an ElderInitMessage that it decided to ignore because it thought it was shutting down.
gemfire2_2430/system.log: [info 2019/05/29 11:00:34.230 PDT <vm_2_thr_40_client2_host1_2430> tid=0x24] Initial (distribution manager) view, View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|2] members: [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead}, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003] gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.463 PDT <unicast receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] Received message 'ElderInitMessage (processorId='1)' from <rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000> gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.574 PDT <Pooled Message Processor 2> tid=0x4d] Waiting for Elder to change. Expecting Elder to be rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, is rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000. gemfire2_2430/system.log: [info 2019/05/29 11:00:34.575 PDT <Pooled Message Processor 2> tid=0x4d] ElderInitMessage (processorId='1): disregarding request from departed member. gemfire2_2430/system.log: [info 2019/05/29 11:00:35.238 PDT <unicast receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] received new view: View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|3] members: [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead}, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004] locator_ds_2354/system.log: [warn 2019/05/29 11:00:50.430 PDT <Pooled Message Processor 2> tid=0x38] 15 seconds have elapsed while waiting for replies: <ElderInitProcessor 1 waiting for 1 replies from [rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003]> on rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000 whose current membership list is: [[rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003, rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001]] [Stack #1 from bgexec15197_2354.log line 2] "Pooled Message Processor 2" #56 daemon prio=5 os_prio=0 tid=0x000000000194e800 nid=0xae3 waiting on condition [0x00007f5c94dce000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000775ff4f08> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:71) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:716) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:787) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:764) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:850) at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:69) at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:53) at org.apache.geode.distributed.internal.ClusterElderManager.lambda$new$0(ClusterElderManager.java:41) at org.apache.geode.distributed.internal.ClusterElderManager$$Lambda$64/1182435120.get(Unknown Source) at org.apache.geode.distributed.internal.ClusterElderManager.initializeElderState(ClusterElderManager.java:107) at org.apache.geode.distributed.internal.ClusterElderManager.getElderState(ClusterElderManager.java:98) at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:2885) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.basicProcess(GrantorRequestProcessor.java:494) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.process(GrantorRequestProcessor.java:489) at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:369) at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:435) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:959) at org.apache.geode.distributed.internal.ClusterDistributionManager.doProcessingThread(ClusterDistributionManager.java:825) at org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$77/482261794.invoke(Unknown Source) at org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121) at org.apache.geode.internal.logging.LoggingThreadFactory$$Lambda$78/2015475584.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) so 2430 did not respond to the ElderInitMessage. That would happen if 2430 was shutting down, if 2354 was no longer a member or if 2430 became the elder.
I added more logging to ClusterElderManager.waitForElder(), which is returning false and causing this "disregarding" message to be logged. What I see is that the server doing this thinks it's shutting down but it most definitely is not:
[debug 2019/05/29 15:58:48.488 PDT <unicast receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] Received message 'ElderInitMessage (processorId=1)' from <rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000> [debug 2019/05/29 15:58:48.510 PDT <Pooled Message Processor 2> tid=0x4d] Wait for elder terminated because we're shutting down [info 2019/05/29 15:58:48.511 PDT <Pooled Message Processor 2> tid=0x4d] ElderInitMessage (processorId=1): disregarding request from departed member. [debug 2019/05/29 15:58:48.599 PDT <unicast receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] processing InstallViewMessage(type=PREPARE; Current ViewID=3; Previous View ID=0; View[rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000|3] members: [rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15283:15283)<ec><v1>:41001{lead}, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15290:15290)<ec><v2>:41002, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15295:15295)<ec><v3>:41004, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15304:15304)<ec><v3>:41003]; cred=null)
Attachments
Issue Links
- links to
This seems to have been introduced during refactoring of Elder management, so it's present in 1.8 and 1.9.
A check for the state of the variable ClusterDistributionManager.isCloseInProgress was changed to use the method isCloseInProgress(), which invokes InternalDistributedSystem.isDisconnecting(). That method is flawed in that it will return true before the InternalDistributedSystem's "dm" variable has been initialized, which is the case during startup.