Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-4650

DLockService.clearGrantor can potentially hang

Details

    Description

      There was a test run in the precheckin pipeline that hung with the following stack:

       

      "RMI TCP Connection(1)-172.17.0.3" #30 daemon prio=5 os_prio=0 tid=0x00007f4560001800 nid=0x191 waiting on condition [0x00007f45771c0000]
      java.lang.Thread.State: TIMED_WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)
      - parking to wait for <0x00000000e082d298> (a java.util.concurrent.CountDownLatch$Sync)
      at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
      at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
      at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
      at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715)
      at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790)
      at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766)
      at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853)
      at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72)
      at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56)
      at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderStateWithTryLock(ClusterDistributionManager.java:3359)
      at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:3309)
      at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:238)
      at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:347)
      at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:327)
      at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:318)
      at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:872)
      at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1227)
      - locked <0x00000000e0837ff0> (a org.apache.geode.distributed.internal.locks.DLockGrantor)
      at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:646)
      at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2358)
      at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2276)
      - locked <0x00000000e05c7468> (a java.lang.Object)
      at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2214)
      at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84)
      at org.apache.geode.internal.cache.GemFireCacheImpl.destroyGatewaySenderLockService(GemFireCacheImpl.java:2043)
      at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2180)
      - locked <0x00000000e04653e0> (a java.lang.Class for org.apache.geode.internal.cache.GemFireCacheImpl)
      at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1960)
      at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1950)
      at org.apache.geode.test.junit.rules.ServerStarterRule.stopMember(ServerStarterRule.java:99)
      at org.apache.geode.test.junit.rules.MemberStarterRule.after(MemberStarterRule.java:81)
      at org.apache.geode.test.dunit.rules.ClusterStartupRule.stopElementInsideVM(ClusterStartupRule.java:412)
      at org.apache.geode.test.junit.rules.VMProvider.lambda$stopVM$fe0d42dc$1(VMProvider.java:35)
      at org.apache.geode.test.junit.rules.VMProvider$$Lambda$53/208982926.run(Unknown Source)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at hydra.MethExecutor.executeObject(MethExecutor.java:244)
      at org.apache.geode.test.dunit.standalone.RemoteDUnitVM.executeMethodOnObject(RemoteDUnitVM.java:70)
      at sun.reflect.GeneratedMethodAccessor52.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
      at sun.rmi.transport.Transport$1.run(Transport.java:200)
      at sun.rmi.transport.Transport$1.run(Transport.java:197)
      at java.security.AccessController.doPrivileged(Native Method)
      at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
      at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
      at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
      at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
      at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$7/1394836008.run(Unknown Source)
      at java.security.AccessController.doPrivileged(Native Method)
      at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      
      Locked ownable synchronizers:
      - <0x00000000e0332230> (a java.util.concurrent.ThreadPoolExecutor$Worker)
      - <0x00000000e08499b0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
      - <0x00000000e08520f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      

      It looks like the cache is shutting down and we are unable to destroy the lock service for the gateway sender.

       

      Attachments

        1. callstacks-2018-02-10-05-25-15.txt
          151 kB
          Jason Huynh
        2. callstacks-2018-02-10-05-25-23.txt
          150 kB
          Jason Huynh
        3. callstacks-2018-02-10-05-25-30.txt
          147 kB
          Jason Huynh

        Issue Links

          Activity

            jasonhuynh Jason Huynh added a comment -

            Attached are the call stacks for the hung run.

            jasonhuynh Jason Huynh added a comment - Attached are the call stacks for the hung run.
            upthewaterspout Dan Smith added a comment -

            We are continuing to see this issue cause DistributedTest to hang on occasion. Must recent hang was in this build

            https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/113

            Hung test: 2018-07-13 04:13:37.012 +0000 org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest testVerifyGatewaySenderProfile[from_v150]

            	at sun.misc.Unsafe.park(Native Method)
            	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
            	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
            	at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
            	at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715)
            	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790)
            	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766)
            	at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853)
            	at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72)
            	at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56)
            	at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderStateWithTryLock(ClusterDistributionManager.java:3359)
            	at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:3309)
            	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:238)
            	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:347)
            	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:327)
            	at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:318)
            	at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:872)
            	at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1227)
            	at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:646)
            	at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2358)
            	at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2276)
            	at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2214)
            	at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84)
            	at org.apache.geode.internal.cache.GemFireCacheImpl.destroyGatewaySenderLockService(GemFireCacheImpl.java:2043)
            	at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2180)
            	at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1960)
            	at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1950)
            	at org.apache.geode.test.dunit.cache.internal.JUnit4CacheTestCase.closeCache(JUnit4CacheTestCase.java:323)
            	at org.apache.geode.test.dunit.cache.internal.JUnit4CacheTestCase.remoteTearDown(JUnit4CacheTestCase.java:378)
            
            
            
            upthewaterspout Dan Smith added a comment - We are continuing to see this issue cause DistributedTest to hang on occasion. Must recent hang was in this build https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/113 Hung test: 2018-07-13 04:13:37.012 +0000 org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest testVerifyGatewaySenderProfile [from_v150] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:715) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:790) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:766) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:853) at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:72) at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:56) at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderStateWithTryLock(ClusterDistributionManager.java:3359) at org.apache.geode.distributed.internal.ClusterDistributionManager.getElderState(ClusterDistributionManager.java:3309) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:238) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:347) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:327) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:318) at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:872) at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1227) at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:646) at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2358) at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2276) at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2214) at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:84) at org.apache.geode.internal.cache.GemFireCacheImpl.destroyGatewaySenderLockService(GemFireCacheImpl.java:2043) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2180) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1960) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1950) at org.apache.geode.test.dunit.cache.internal.JUnit4CacheTestCase.closeCache(JUnit4CacheTestCase.java:323) at org.apache.geode.test.dunit.cache.internal.JUnit4CacheTestCase.remoteTearDown(JUnit4CacheTestCase.java:378)
            jinmeiliao Jinmei Liao added a comment -

            this build also hung with this test:

            https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/114

            hung test: 2018-07-13 16:49:45.530 +0000 org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest testVerifyGatewaySenderProfile[from_v110]

            jinmeiliao Jinmei Liao added a comment - this build also hung with this test: https://concourse.apachegeode-ci.info/teams/main/pipelines/develop/jobs/DistributedTest/builds/114 hung test: 2018-07-13 16:49:45.530 +0000 org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest testVerifyGatewaySenderProfile [from_v110]

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/develop from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/develop from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/develop from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/develop from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/concourse-staging from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/concourse-staging from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/concourse-staging from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/concourse-staging from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/windows-heavy-lifter from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/windows-heavy-lifter from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/windows-heavy-lifter from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/windows-heavy-lifter from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/feature/GEODE-5705 from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/feature/ GEODE-5705 from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>

            Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/feature/GEODE-5705 from balesh2
            [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ]

            GEODE-4650: Refactor Elder selection (#2393)

            GEODE-4650: Resolve race condition in selection of the elder

            • no longer cache the elder, re-compute the elder when needed
            • extract elder logic to a new class to make unit testing possible
            • adds tests for elder selection
            • adds tests of DLock Grantor failover
            • removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system.
            • fix testFairness so that it can be run repeatedly in the same JVM

            Signed-off-by: Dan Smith <dsmith@pivotal.io>
            Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io>
            Signed-off-by: Ken Howe <khowe@pivotal.io>

            jira-bot ASF subversion and git services added a comment - Commit 52bd3fc63970e2929fc8df76a7621d5147a6e393 in geode's branch refs/heads/feature/ GEODE-5705 from balesh2 [ https://gitbox.apache.org/repos/asf?p=geode.git;h=52bd3fc ] GEODE-4650 : Refactor Elder selection (#2393) GEODE-4650 : Resolve race condition in selection of the elder no longer cache the elder, re-compute the elder when needed extract elder logic to a new class to make unit testing possible adds tests for elder selection adds tests of DLock Grantor failover removes isAdam() - isAdam used to mean that the member was alone (that there were no non-surprise, non-admin members in the cluster) when it joined. This was only used in two places. The first, in the DLockService, protected against recovering dlocks when there isn't a cluster. This usage is replaced with a check for isLoner(). The other use of isAdam was in ElderInitProcessor and was redundant with an inner check if there were other members in the distributed system. fix testFairness so that it can be run repeatedly in the same JVM Signed-off-by: Dan Smith <dsmith@pivotal.io> Signed-off-by: Galen O'Sullivan <gosullivan@pivotal.io> Signed-off-by: Ken Howe <khowe@pivotal.io>
            ladyvader Lynn Hughes-Godfrey added a comment - - edited

            Note that this hang reproduced in CI:
            https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/UpgradeTestOpenJDK11/builds/707

            Since this was fixed in 1.8, perhaps we should expect to see it in rolling upgrade tests from older versions ... so just noting that it reproduced (without reopening).

            Hung Test:
            2019-05-10 22:41:28.511 +0000 org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterCurrentSiteMemberFailoverWithOldClient testSecondaryEventsNotReprocessedAfterCurrentSiteMemberFailoverWithOldClient[from_v100]

            Stack dump (from callstacks):

            "RMI TCP Connection(3)-172.17.0.4" #35 daemon prio=5 os_prio=0 cpu=5492.78ms elapsed=2867.65s tid=0x00007f23f8001800 nid=0x212 waiting on condition  [0x00007f244dab5000]
               java.lang.Thread.State: TIMED_WAITING (parking)
                    at jdk.internal.misc.Unsafe.park(java.base@11.0.2/Native Method)
                    - parking to wait for  <0x00000000e0804d68> (a java.util.concurrent.CountDownLatch$Sync)
                    at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.2/LockSupport.java:234)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.2/AbstractQueuedSynchronizer.java:1079)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.2/AbstractQueuedSynchronizer.java:1369)
                    at java.util.concurrent.CountDownLatch.await(java.base@11.0.2/CountDownLatch.java:278)
                    at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
                    at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:736)
                    at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:812)
                    at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:789)
                    at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:879)
                    at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:76)
                    at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:57)
                    at org.apache.geode.distributed.internal.DistributionManager.getElderStateWithTryLock(DistributionManager.java:3628)
                    at org.apache.geode.distributed.internal.DistributionManager.getElderState(DistributionManager.java:3574)
                    at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:254)
                    at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:377)
                    at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:352)
                    at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:340)
                    at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:885)
                    at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1274)
                    - locked <0x00000000e0b17d48> (a org.apache.geode.distributed.internal.locks.DLockGrantor)
                    at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:663)
                    at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2606)
                    at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2521)
                    - locked <0x00000000e0b17e78> (a java.lang.Object)
                    at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2420)
                    at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:98)
                    at org.apache.geode.internal.cache.GemFireCacheImpl.destroyGatewaySenderLockService(GemFireCacheImpl.java:1943)
                    at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2088)
                    - locked <0x00000000e0922ad8> (a java.lang.Class for org.apache.geode.internal.cache.GemFireCacheImpl)
                    at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1862)
                    at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1858)
                    at org.apache.geode.test.dunit.cache.internal.JUnit4CacheTestCase.closeCache(JUnit4CacheTestCase.java:327)
            

            Test report artifacts from this job are available at:

            http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0269/test-artifacts/1557531731/upgradetestfiles-OpenJDK11-1.10.0-SNAPSHOT.0269.tgz
            
            ladyvader Lynn Hughes-Godfrey added a comment - - edited Note that this hang reproduced in CI: https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/UpgradeTestOpenJDK11/builds/707 Since this was fixed in 1.8, perhaps we should expect to see it in rolling upgrade tests from older versions ... so just noting that it reproduced (without reopening). Hung Test: 2019-05-10 22:41:28.511 +0000 org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterCurrentSiteMemberFailoverWithOldClient testSecondaryEventsNotReprocessedAfterCurrentSiteMemberFailoverWithOldClient [from_v100] Stack dump (from callstacks): "RMI TCP Connection(3)-172.17.0.4" #35 daemon prio=5 os_prio=0 cpu=5492.78ms elapsed=2867.65s tid=0x00007f23f8001800 nid=0x212 waiting on condition [0x00007f244dab5000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.2/Native Method) - parking to wait for <0x00000000e0804d68> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.2/LockSupport.java:234) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.2/AbstractQueuedSynchronizer.java:1079) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.2/AbstractQueuedSynchronizer.java:1369) at java.util.concurrent.CountDownLatch.await(java.base@11.0.2/CountDownLatch.java:278) at org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64) at org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:736) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:812) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:789) at org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:879) at org.apache.geode.distributed.internal.locks.ElderInitProcessor.init(ElderInitProcessor.java:76) at org.apache.geode.distributed.internal.locks.ElderState.<init>(ElderState.java:57) at org.apache.geode.distributed.internal.DistributionManager.getElderStateWithTryLock(DistributionManager.java:3628) at org.apache.geode.distributed.internal.DistributionManager.getElderState(DistributionManager.java:3574) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.startElderCall(GrantorRequestProcessor.java:254) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:377) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.basicOp(GrantorRequestProcessor.java:352) at org.apache.geode.distributed.internal.locks.GrantorRequestProcessor.clearGrantor(GrantorRequestProcessor.java:340) at org.apache.geode.distributed.internal.locks.DLockService.clearGrantor(DLockService.java:885) at org.apache.geode.distributed.internal.locks.DLockGrantor.destroy(DLockGrantor.java:1274) - locked <0x00000000e0b17d48> (a org.apache.geode.distributed.internal.locks.DLockGrantor) at org.apache.geode.distributed.internal.locks.DLockService.nullLockGrantorId(DLockService.java:663) at org.apache.geode.distributed.internal.locks.DLockService.basicDestroy(DLockService.java:2606) at org.apache.geode.distributed.internal.locks.DLockService.destroyAndRemove(DLockService.java:2521) - locked <0x00000000e0b17e78> (a java.lang.Object) at org.apache.geode.distributed.internal.locks.DLockService.destroyServiceNamed(DLockService.java:2420) at org.apache.geode.distributed.DistributedLockService.destroy(DistributedLockService.java:98) at org.apache.geode.internal.cache.GemFireCacheImpl.destroyGatewaySenderLockService(GemFireCacheImpl.java:1943) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2088) - locked <0x00000000e0922ad8> (a java.lang.Class for org.apache.geode.internal.cache.GemFireCacheImpl) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1862) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1858) at org.apache.geode.test.dunit.cache.internal.JUnit4CacheTestCase.closeCache(JUnit4CacheTestCase.java:327) Test report artifacts from this job are available at: http://files.apachegeode-ci.info/builds/apache-develop-main/1.10.0-SNAPSHOT.0269/test-artifacts/1557531731/upgradetestfiles-OpenJDK11-1.10.0-SNAPSHOT.0269.tgz

            This has reproduced in:
            https://concourse.gemfire-ci.info/teams/main/pipelines/gemfire-develop-main/jobs/UpgradeTestOpenJDK11/builds/793

            org.apache.geode.cache.wan.WANRollingUpgradeVerifyGatewaySenderProfile > testVerifyGatewaySenderProfile[from_v190] FAILED
            org.apache.geode.test.dunit.RMIException: While invoking org.apache.geode.test.dunit.IgnoredException$1.run in VM 1 running on Host 3f5b4b164e0b with 4 VMs with version 180
            Caused by:
            java.lang.IllegalStateException: VM not available: VM 1 running on Host 3f5b4b164e0b with 4 VMs with version 180

            Test report artifacts from this job are available at:
            gs://gemfire-test-artifacts/builds/gemfire-develop-main/9.9.0-build.0255/test-artifacts/1563913236/upgradetestfiles-OpenJDK11-9.9.0-build.0255.tgz

            agingade Anilkumar Gingade added a comment - This has reproduced in: https://concourse.gemfire-ci.info/teams/main/pipelines/gemfire-develop-main/jobs/UpgradeTestOpenJDK11/builds/793 org.apache.geode.cache.wan.WANRollingUpgradeVerifyGatewaySenderProfile > testVerifyGatewaySenderProfile [from_v190] FAILED org.apache.geode.test.dunit.RMIException: While invoking org.apache.geode.test.dunit.IgnoredException$1.run in VM 1 running on Host 3f5b4b164e0b with 4 VMs with version 180 Caused by: java.lang.IllegalStateException: VM not available: VM 1 running on Host 3f5b4b164e0b with 4 VMs with version 180 Test report artifacts from this job are available at: gs://gemfire-test-artifacts/builds/gemfire-develop-main/9.9.0-build.0255/test-artifacts/1563913236/upgradetestfiles-OpenJDK11-9.9.0-build.0255.tgz
            geodeintegration Geode Integration added a comment - Seen in UpgradeTestOpenJDK11 #80 .

            People

              Unassigned Unassigned
              jasonhuynh Jason Huynh
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h 20m
                  6h 20m