Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-10330

Resource issues lead to "MemberDisconnectedException: Member isn't responding to heartbeat requests"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 1.16.0
    • 1.16.0
    • None

    Description

      A failure was observed in 
      DistributedMulticastRegionWithUDPSecurityDUnitTest > testMulticastAfterReconnect due to suspect strings with fatal-level logging of "Membership service failure: Member isn't responding to heartbeat requests".

      Investigating the logs showed all members reporting long statistics sampling wakeup delays, indicating resource issues:

      [vm3] [warn 2022/05/21 07:28:16.251 UTC LocatorWithMcast <StatSampler> tid=0xb8] Statistics sampling thread detected a wakeup delay of 4760 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
      
      ...
      
      [locator] [warn 2022/05/21 07:28:20.288 UTC  <StatSampler> tid=0x3b] Statistics sampling thread detected a wakeup delay of 12400 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
      
      ...
      
      [vm1] [warn 2022/05/21 07:28:20.969 UTC vm1 <StatSampler> tid=0xda] Statistics sampling thread detected a wakeup delay of 13738 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
      
      ...
      
      [vm0] [warn 2022/05/21 07:28:22.226 UTC vm0 <StatSampler> tid=0xa9] Statistics sampling thread detected a wakeup delay of 15110 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics. 

       

      After downloading the test artifacts and using the progress tool from the dev-tools directory in the Geode repository, the following tests were found to be running during the resource issues, possibly indicating that one or more of them are particularly resource-intensive:

      $> progress -r '2022-05-21 07:28:16.251 -0000' | grep org | sort
      org.apache.geode.cache.PRCacheListenerWithInterestPolicyAllDistributedTest.afterUpdateIsInvokedInEveryMember[0: redundancy=0] org.apache.geode.cache.lucene.LuceneQueriesReindexDUnitTest.recreateIndexWithDifferentFieldsShouldFail(PARTITION_OVERFLOW_TO_DISK) [2] org.apache.geode.cache.query.cq.dunit.CqDataUsingPoolOptimizedExecuteDUnitTest.testCQHAWithState org.apache.geode.cache.query.cq.dunit.PartitionedRegionCqQueryDUnitTest.testPartitionedCqOnAccessorBridgeServer org.apache.geode.cache30.CallbackArgDUnitTest.testForCA org.apache.geode.cache30.DistributedMulticastRegionWithUDPSecurityDUnitTest.testMulticastAfterReconnect org.apache.geode.cache30.DistributedNoAckRegionCCEOffHeapDUnitTest.testDistributedInvalidate org.apache.geode.cache30.GlobalRegionOffHeapDUnitTest.testOrderedUpdates org.apache.geode.cache30.ReconnectWithClusterConfigurationDUnitTest.testReconnectAfterMeltdown org.apache.geode.distributed.internal.P2PMessagingConcurrencyDUnitTest.testP2PMessaging(true, false, 32768, 65536) [6] org.apache.geode.disttx.PRDistTXDUnitTest.testSimulaneousChildRegionCreation org.apache.geode.internal.cache.ClientServerTransactionCCEDUnitTest.testClientCommitFunctionWithFailure org.apache.geode.internal.cache.eviction.OffHeapEvictionStatsDUnitTest.testHeapLruCounter org.apache.geode.internal.cache.wan.concurrent.ConcurrentParallelGatewaySenderOperation_1_DUnitTest.testParallelPropagationSenderStartAfterStopOnAccessorNode org.apache.geode.internal.cache.wan.offheap.ParallelGatewaySenderOperationsOffHeapDistributedTest.testParallelGatewaySenderStartOnAccessorNode org.apache.geode.internal.cache.wan.serial.SerialWANPropagation_PartitionedRegionDUnitTest.testPartitionedSerialPropagationHA org.apache.geode.internal.tcp.TCPConduitDUnitTest.basicAcceptConnection[0] org.apache.geode.management.internal.configuration.ClusterConfigImportDUnitTest.importFailWithExistingRegion org.apache.geode.rest.internal.web.controllers.RestAPIsOnGroupsFunctionExecutionDUnitTest.testBasicP2PFunctionSelectedGroup[1] org.apache.geode.session.tests.Jetty9CachingClientServerTest.failureShouldStillAllowOtherContainersDataAccess org.apache.geode.session.tests.Tomcat8ClientServerCustomCacheXmlTest.containersShouldExpireInSetTimeframe org.apache.geode.session.tests.Tomcat8Test.containersShouldReplicateCookies org.apache.geode.session.tests.Tomcat9ClientServerTest.invalidationShouldRemoveValueAccessForAllContainers
      

      Future failures due to this sort of resource issue should also list concurrently running tests so that repeat appearances by individual tests can be used to identify the culprits.

      Attachments

        Issue Links

          Activity

            People

              nnag Nabarun Nag
              donalevans Donal Evans
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: