[GEODE-8809] Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: messaging
Labels:
None

Description

We see this characteristic failure in a number of proprietary applications:

member stops sending heartbeats
The coordinator is requesting availability test from a member,
member gets it after a delay
the delay causes the server to be kicked out (receives FordedDisconnectException)
operations fail.
server reconnects.

Usually when the failure detector/health monitor kicks a member out of the distributed system it is for one of these reasons:

1. Member really was malfunctioning or unreachable (i.e. something outside of health monitoring had a problem)

a. Network problems

i. Partition: 2-way, N-way

ii. Slowdown or error rate increase

b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or more in heartbeat generation on that member.

i. Geode was running in a virtualized environment and the virtualization system didn’t give the Geode process sufficient CPU

ii. JVM memory was over-utilized so garbage collection (pauses) took too long

iii. There was simply too much CPU demand and the product failed to reserve enough CPU capacity to keep the heartbeat going

This ticket captures situations where the failure detector causes a member to be kicked out but we cannot prove definitively that any of these as a root cause.

Attachments

Activity

People

Assignee:: Bill Burcham

Reporter:: Nabarun Nag

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Jan/21 18:01

Updated:: 05/Apr/21 21:57

Resolved:: 08/Mar/21 19:16