Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-6823

Hang in ElderInitProcessor.init()

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.0, 1.9.0
    • 1.10.0
    • None

    Description

      A locator and a server were spinning up at the same time and the locator became stuck trying to initialize a distributed lock service.  Extra logging showed that the server received an ElderInitMessage that it decided to ignore because it thought it was shutting down.

       

      gemfire2_2430/system.log: [info 2019/05/29 11:00:34.230 PDT <vm_2_thr_40_client2_host1_2430> tid=0x24] Initial (distribution manager) view, View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|2] members: [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead}, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003]
      
      gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.463 PDT <unicast receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] Received message 'ElderInitMessage (processorId='1)' from <rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000>
      
      
      gemfire2_2430/system.log: [debug 2019/05/29 11:00:34.574 PDT <Pooled Message Processor 2> tid=0x4d] Waiting for Elder to change. Expecting Elder to be rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, is rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000.
      
      gemfire2_2430/system.log: [info 2019/05/29 11:00:34.575 PDT <Pooled Message Processor 2> tid=0x4d] ElderInitMessage (processorId='1): disregarding request from departed member.
      
      gemfire2_2430/system.log: [info 2019/05/29 11:00:35.238 PDT <unicast receiver,rs-GEM-2316-0906a2i3large-hydra-client-11-10705> tid=0x46] received new view: View[rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000|3] members: [rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001{lead}, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004]
      
      
      locator_ds_2354/system.log: [warn 2019/05/29 11:00:50.430 PDT <Pooled Message Processor 2> tid=0x38] 15 seconds have elapsed while waiting for replies: <ElderInitProcessor 1 waiting for 1 replies from [rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003]> on rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000 whose current membership list is: [[rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2437:2437)<ec><v3>:41004, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire2_host1_2430:2430)<ec><v2>:41003, rs-GEM-2316-0906a2i3large-hydra-client-11(2354:locator)<ec><v0>:41000, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2420:2420)<ec><v2>:41002, rs-GEM-2316-0906a2i3large-hydra-client-11(gemfire1_host1_2416:2416)<ec><v1>:41001]]
      
      [Stack #1 from bgexec15197_2354.log line 2]
      
      "Pooled Message Processor 2" #56 daemon prio=5 os_prio=0 tid=0x000000000194e800 nid=0xae3 waiting on condition [0x00007f5c94dce000]
         java.lang.Thread.State: TIMED_WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x0000000775ff4f08> (a java.util.concurrent.CountDownLatch$Sync)
      	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
      	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
      	at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:71)
      	at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:716)
      	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:787)
      	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:764)
      	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:850)
      	at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:69)
      	at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:53)
      	at org.apache.geode.distributed.internal.ClusterElderManager.lambda$new$0(ClusterElderManager.java:41)
      	at org.apache.geode.distributed.internal.ClusterElderManager$$Lambda$64/1182435120.get(Unknown Source)
      	at org.apache.geode.distributed.internal.ClusterElderManager.initializeElderState(ClusterElderManager.java:107)
      	at org.apache.geode.distributed.internal.ClusterElderManager.getElderState(ClusterElderManager.java:98)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:2885)
      	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.basicProcess(GrantorRequestProcessor.java:494)
      	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor$GrantorRequestMessage.process(GrantorRequestProcessor.java:489)
      	at org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:369)
      	at org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:435)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:959)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager.doProcessingThread(ClusterDistributionManager.java:825)
      	at org.apache.geode.distributed.internal.ClusterDistributionManager$$Lambda$77/482261794.invoke(Unknown Source)
      	at org.apache.geode.internal.logging.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:121)
      	at org.apache.geode.internal.logging.LoggingThreadFactory$$Lambda$78/2015475584.run(Unknown Source)
      	at java.lang.Thread.run(Thread.java:748)
      
      
      so 2430 did not respond to the ElderInitMessage.  That would happen if 2430 was shutting down, if 2354 was no longer a member or if 2430 became the elder.
      

      I added more logging to ClusterElderManager.waitForElder(), which is returning false and causing this "disregarding" message to be logged. What I see is that the server doing this thinks it's shutting down but it most definitely is not:

      [debug 2019/05/29 15:58:48.488 PDT <unicast receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] Received message 'ElderInitMessage (processorId=1)' from <rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000>
      
      [debug 2019/05/29 15:58:48.510 PDT <Pooled Message Processor 2> tid=0x4d] Wait for elder terminated because we're shutting down
      
      [info 2019/05/29 15:58:48.511 PDT <Pooled Message Processor 2> tid=0x4d] ElderInitMessage (processorId=1): disregarding request from departed member.
      
      [debug 2019/05/29 15:58:48.599 PDT <unicast receiver,rs-GEM-2316-1539a0i3large-hydra-client-4-39597> tid=0x46] processing InstallViewMessage(type=PREPARE; Current ViewID=3; Previous View ID=0; View[rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000|3] members: [rs-GEM-2316-1539a0i3large-hydra-client-4(15224:locator)<ec><v0>:41000, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15283:15283)<ec><v1>:41001{lead}, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire1_host1_15290:15290)<ec><v2>:41002, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15295:15295)<ec><v3>:41004, rs-GEM-2316-1539a0i3large-hydra-client-4(gemfire2_host1_15304:15304)<ec><v3>:41003]; cred=null)

      Attachments

        Issue Links

          Activity

            This seems to have been introduced during refactoring of Elder management, so it's present in 1.8 and 1.9.

            A check for the state of the variable ClusterDistributionManager.isCloseInProgress was changed to use the method isCloseInProgress(), which invokes InternalDistributedSystem.isDisconnecting().  That method is flawed in that it will return true before the InternalDistributedSystem's "dm" variable has been initialized, which is the case during startup.

            bschuchardt Bruce J Schuchardt added a comment - This seems to have been introduced during refactoring of Elder management, so it's present in 1.8 and 1.9. A check for the state of the variable ClusterDistributionManager.isCloseInProgress was changed to use the method isCloseInProgress(), which invokes InternalDistributedSystem.isDisconnecting().  That method is flawed in that it will return true before the InternalDistributedSystem's "dm" variable has been initialized, which is the case during startup.

            Commit 49997f3ea91109a8f8e17404f2fac5e7af2c19f3 in geode's branch refs/heads/feature/GEODE-6823 from Bruce Schuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=49997f3 ]

            GEODE-6823 Hang in ElderInitProcessor.init(

            This corrects elder init processing to use the isCloseInProgress to
            check for shutdown. A coding error during refactoring caused it to check the
            isCloseInProgress() method, which did more than just return the value of
            the isCloseInProgress variable and was incorrectly reporting a close in progress
            during startup operations.

            I've renamed the old isCloseInProgress() method to avoid similar coding
            errors in the future and added a new implementation that merely returns
            the value of the field, as you'd expect it to do.

            While writing tests I found that the ClusterElderManagerTest was leaving
            blocked threads behind because the waitForElder() method in
            ClusterElderManager was not interruptable. I've changed that method to
            be interruptable. We don't interrupt message-processing threads so this
            should be a safe change.

            jira-bot ASF subversion and git services added a comment - Commit 49997f3ea91109a8f8e17404f2fac5e7af2c19f3 in geode's branch refs/heads/feature/ GEODE-6823 from Bruce Schuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=49997f3 ] GEODE-6823 Hang in ElderInitProcessor.init( This corrects elder init processing to use the isCloseInProgress to check for shutdown. A coding error during refactoring caused it to check the isCloseInProgress() method, which did more than just return the value of the isCloseInProgress variable and was incorrectly reporting a close in progress during startup operations. I've renamed the old isCloseInProgress() method to avoid similar coding errors in the future and added a new implementation that merely returns the value of the field, as you'd expect it to do. While writing tests I found that the ClusterElderManagerTest was leaving blocked threads behind because the waitForElder() method in ClusterElderManager was not interruptable. I've changed that method to be interruptable. We don't interrupt message-processing threads so this should be a safe change.

            Commit ef18e4d6c93455518023ca71ed481a70f1771fde in geode's branch refs/heads/develop from Bruce Schuchardt
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=ef18e4d ]

            GEODE-6823 Hang in ElderInitProcessor.init()

            This corrects elder init processing to use the isCloseInProgress to
            check for shutdown. A coding error during refactoring caused it to check the
            isCloseInProgress() method, which did more than just return the value of
            the isCloseInProgress variable and was incorrectly reporting a close in progress
            during startup operations.

            I've renamed the old isCloseInProgress() method to avoid similar coding
            errors in the future and added a new implementation that merely returns
            the value of the field, as you'd expect it to do.

            While writing tests I found that the ClusterElderManagerTest was leaving
            blocked threads behind because the waitForElder() method in
            ClusterElderManager was not interruptable. I've changed that method to
            be interruptable. We don't interrupt message-processing threads so this
            should be a safe change.

            jira-bot ASF subversion and git services added a comment - Commit ef18e4d6c93455518023ca71ed481a70f1771fde in geode's branch refs/heads/develop from Bruce Schuchardt [ https://gitbox.apache.org/repos/asf?p=geode.git;h=ef18e4d ] GEODE-6823 Hang in ElderInitProcessor.init() This corrects elder init processing to use the isCloseInProgress to check for shutdown. A coding error during refactoring caused it to check the isCloseInProgress() method, which did more than just return the value of the isCloseInProgress variable and was incorrectly reporting a close in progress during startup operations. I've renamed the old isCloseInProgress() method to avoid similar coding errors in the future and added a new implementation that merely returns the value of the field, as you'd expect it to do. While writing tests I found that the ClusterElderManagerTest was leaving blocked threads behind because the waitForElder() method in ClusterElderManager was not interruptable. I've changed that method to be interruptable. We don't interrupt message-processing threads so this should be a safe change.

            People

              bschuchardt Bruce J Schuchardt
              bschuchardt Bruce J Schuchardt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m