Uploaded image for project: 'Geode'
  1. Geode
  2. GEODE-8809

Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • messaging
    • None


      We see this characteristic failure in a number of proprietary applications:

      • member stops sending heartbeats
      • The coordinator is requesting availability test from a member, 
      • member gets it after a delay
      • the delay causes the server to be kicked out (receives FordedDisconnectException)
      • operations fail.
      • server reconnects.

      Usually when the failure detector/health monitor kicks a member out of the distributed system it is for one of these reasons:

      1. Member really was malfunctioning or unreachable (i.e. something outside of health monitoring had a problem)

        a. Network problems

          i. Partition: 2-way, N-way

          ii. Slowdown or error rate increase

        b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or more in heartbeat generation on that member.

          i. Geode was running in a virtualized environment and the virtualization system didn’t give the Geode process sufficient CPU

          ii. JVM memory was over-utilized so garbage collection (pauses) took too long

          iii. There was simply too much CPU demand and the product failed to reserve enough CPU capacity to keep the heartbeat going

      This ticket captures situations where the failure detector causes a member to be kicked out but we cannot prove definitively that any of these as a root cause.




            burcham Bill Burcham
            nnag Nabarun Nag
            0 Vote for this issue
            2 Start watching this issue