[GEODE-6823] Hang in ElderInitProcessor.init() - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.0, 1.9.0
Fix Version/s: 1.10.0
Component/s: distributed lock service
Labels:
None

Description

A locator and a server were spinning up at the same time and the locator became stuck trying to initialize a distributed lock service. Extra logging showed that the server received an ElderInitMessage that it decided to ignore because it thought it was shutting down.

gemfire2_2430/system.log: [info 2019/05/29 11:00:34.230 PDT <vm_2_thr_40_client2_host1_2430> tid=0x24] Initial (distribution manager) view, View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|2] members: [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead}, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003]

gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.463 PDT <unicast receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] Received message 'ElderInitMessage (processorId='1)' from <rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000>


gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.574 PDT <Pooled Message Processor 2> tid=0x4d] Waiting for Elder to change. Expecting Elder to be rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, is rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000.

gemfire2_2430/system.log: [info 2019/05/29 11:00:34.575 PDT <Pooled Message Processor 2> tid=0x4d] ElderInitMessage (processorId='1): disregarding request from departed member.

gemfire2_2430/system.log: [info 2019/05/29 11:00:35.238 PDT <unicast receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] received new view: View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|3] members: [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead}, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004]


locator_ds_2354/system.log: [warn 2019/05/29 11:00:50.430 PDT <Pooled Message Processor 2> tid=0x38] 15 seconds have elapsed while waiting for replies: <ElderInitProcessor 1 waiting for 1 replies from [rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003]> on rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000 whose current membership list is: [[rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003, rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001]]

[Stack #1 from bgexec15197_2354.log line 2]

"Pooled Message Processor 2" #56 daemon prio=5 os_prio=0 tid=0x000000000194e800 nid=0xae3 waiting on condition [0x00007f5c94dce000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000775ff4f08> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
	at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:71)
	at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:716)
	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:787)
	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:764)
	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:850)
	at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:69)
	at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:53)
	at org.apache.geode.distributed.internal.ClusterElderManager.lambda$new$0(ClusterElderManager.java:41)
	at org.apache.geode.distributed.internal.ClusterElderManager$$Lambda$64/1182435120.get(Unknown Source)
	at org.apache.geode.distributed.internal.ClusterElderManager.initializeElderState(ClusterElderManager.java:107)
	at org.apache.geode.distributed.internal.ClusterElderManager.getElderState(ClusterElderManager.java:98)
	at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:2885)
	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.basicProcess(GrantorRequestProcessor.java:494)
	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.process(GrantorRequestProcessor.java:489)
	at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:369)
	at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:435)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:959)
	at org.apache.geode.distributed.internal.ClusterDistributionManager.doProcessingThread(ClusterDistributionManager.java:825)
	at org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$77/482261794.invoke(Unknown Source)
	at org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121)
	at org.apache.geode.internal.logging.LoggingThreadFactory$$Lambda$78/2015475584.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:748)


so 2430 did not respond to the ElderInitMessage.  That would happen if 2430 was shutting down, if 2354 was no longer a member or if 2430 became the elder.

I added more logging to ClusterElderManager.waitForElder(), which is returning false and causing this "disregarding" message to be logged. What I see is that the server doing this thinks it's shutting down but it most definitely is not:

[debug 2019/05/29 15:58:48.488 PDT <unicast receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] Received message 'ElderInitMessage (processorId=1)' from <rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000>

[debug 2019/05/29 15:58:48.510 PDT <Pooled Message Processor 2> tid=0x4d] Wait for elder terminated because we're shutting down

[info 2019/05/29 15:58:48.511 PDT <Pooled Message Processor 2> tid=0x4d] ElderInitMessage (processorId=1): disregarding request from departed member.

[debug 2019/05/29 15:58:48.599 PDT <unicast receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] processing InstallViewMessage(type=PREPARE; Current ViewID=3; Previous View ID=0; View[rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000|3] members: [rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15283:15283)<ec><v1>:41001{lead}, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15290:15290)<ec><v2>:41002, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15295:15295)<ec><v3>:41004, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15304:15304)<ec><v3>:41003]; cred=null)

Attachments

Issue Links

links to

GitHub Pull Request #3656

Activity

Ascending order - Click to sort in descending order

Bruce J Schuchardt added a comment - 31/May/19 19:58

This seems to have been introduced during refactoring of Elder management, so it's present in 1.8 and 1.9.

A check for the state of the variable ClusterDistributionManager.isCloseInProgress was changed to use the method isCloseInProgress(), which invokes InternalDistributedSystem.isDisconnecting(). That method is flawed in that it will return true before the InternalDistributedSystem's "dm" variable has been initialized, which is the case during startup.

Bruce J Schuchardt added a comment - 31/May/19 19:58 This seems to have been introduced during refactoring of Elder management, so it's present in 1.8 and 1.9. A check for the state of the variable ClusterDistributionManager.isCloseInProgress was changed to use the method isCloseInProgress(), which invokes InternalDistributedSystem.isDisconnecting(). That method is flawed in that it will return true before the InternalDistributedSystem's "dm" variable has been initialized, which is the case during startup.

ASF subversion and git services added a comment - 31/May/19 21:09

Commit 49997f3ea91109a8f8e17404f2fac5e7af2c19f3 in geode's branch refs/heads/feature/~~GEODE-6823~~ from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=49997f3 ]

~~GEODE-6823~~ Hang in ElderInitProcessor.init(

This corrects elder init processing to use the isCloseInProgress to
check for shutdown. A coding error during refactoring caused it to check the
isCloseInProgress() method, which did more than just return the value of
the isCloseInProgress variable and was incorrectly reporting a close in progress
during startup operations.

I've renamed the old isCloseInProgress() method to avoid similar coding
errors in the future and added a new implementation that merely returns
the value of the field, as you'd expect it to do.

While writing tests I found that the ClusterElderManagerTest was leaving
blocked threads behind because the waitForElder() method in
ClusterElderManager was not interruptable. I've changed that method to
be interruptable. We don't interrupt message-processing threads so this
should be a safe change.

ASF subversion and git services added a comment - 31/May/19 21:09 Commit 49997f3ea91109a8f8e17404f2fac5e7af2c19f3 in geode's branch refs/heads/feature/ GEODE-6823 from Bruce Schuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=49997f3 ] GEODE-6823 Hang in ElderInitProcessor.init( This corrects elder init processing to use the isCloseInProgress to check for shutdown. A coding error during refactoring caused it to check the isCloseInProgress() method, which did more than just return the value of the isCloseInProgress variable and was incorrectly reporting a close in progress during startup operations. I've renamed the old isCloseInProgress() method to avoid similar coding errors in the future and added a new implementation that merely returns the value of the field, as you'd expect it to do. While writing tests I found that the ClusterElderManagerTest was leaving blocked threads behind because the waitForElder() method in ClusterElderManager was not interruptable. I've changed that method to be interruptable. We don't interrupt message-processing threads so this should be a safe change.

ASF subversion and git services added a comment - 03/Jun/19 15:59

Commit ef18e4d6c93455518023ca71ed481a70f1771fde in geode's branch refs/heads/develop from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=ef18e4d ]

~~GEODE-6823~~ Hang in ElderInitProcessor.init()

This corrects elder init processing to use the isCloseInProgress to
check for shutdown. A coding error during refactoring caused it to check the
isCloseInProgress() method, which did more than just return the value of
the isCloseInProgress variable and was incorrectly reporting a close in progress
during startup operations.

I've renamed the old isCloseInProgress() method to avoid similar coding
errors in the future and added a new implementation that merely returns
the value of the field, as you'd expect it to do.

While writing tests I found that the ClusterElderManagerTest was leaving
blocked threads behind because the waitForElder() method in
ClusterElderManager was not interruptable. I've changed that method to
be interruptable. We don't interrupt message-processing threads so this
should be a safe change.

ASF subversion and git services added a comment - 03/Jun/19 15:59 Commit ef18e4d6c93455518023ca71ed481a70f1771fde in geode's branch refs/heads/develop from Bruce Schuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=ef18e4d ] GEODE-6823 Hang in ElderInitProcessor.init() This corrects elder init processing to use the isCloseInProgress to check for shutdown. A coding error during refactoring caused it to check the isCloseInProgress() method, which did more than just return the value of the isCloseInProgress variable and was incorrectly reporting a close in progress during startup operations. I've renamed the old isCloseInProgress() method to avoid similar coding errors in the future and added a new implementation that merely returns the value of the field, as you'd expect it to do. While writing tests I found that the ClusterElderManagerTest was leaving blocked threads behind because the waitForElder() method in ClusterElderManager was not interruptable. I've changed that method to be interruptable. We don't interrupt message-processing threads so this should be a safe change.

Geode

Hang in ElderInitProcessor.init()

Details

Description

Attachments

Issue Links

Activity

People

Dates

Time Tracking