[HDFS-9305] Delayed heartbeat processing causes storm of subsequent heartbeats - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.1
Fix Version/s: 2.8.0, 2.7.2, 3.0.0-alpha1
Component/s: datanode
Labels:
None

Hadoop Flags:

Reviewed

Description

A DataNode typically sends a heartbeat to the NameNode every 3 seconds. We expect heartbeat handling to complete relatively quickly. However, if something unexpected causes heartbeat processing to get blocked, such as a long GC or heavy lock contention within the NameNode, then heartbeat processing would be delayed. After recovering from this delay, the DataNode then starts sending a storm of heartbeat messages in a tight loop. In a large cluster with many DataNodes, this storm of heartbeat messages could cause harmful load on the NameNode and make overall cluster recovery more difficult.

The bug appears to be caused by incorrect timekeeping inside BPServiceActor. The next heartbeat time is always calculated as a delta from the previous heartbeat time, without any compensation for possible long latency on an individual heartbeat RPC. The only mitigation would be restarting all DataNodes to force a reset of the heartbeat schedule, or simply wait out the storm until the scheduling catches up and corrects itself.

This problem would not manifest after a NameNode restart. In that case, the NameNode would respond to the first heartbeat by telling the DataNode to re-register, and BPServiceActor#reRegister would reset the heartbeat schedule to the current time. I believe the problem would only manifest if the NameNode process kept alive, but processed heartbeats unexpectedly slowly.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-9305.01.patch
26/Oct/15 17:22
3 kB
Arpit Agarwal
HDFS-9305.02.patch
26/Oct/15 19:01
3 kB
Arpit Agarwal

Activity

People

Assignee:: Arpit Agarwal

Reporter:: Chris Nauroth

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 26/Oct/15 04:42

Updated:: 06/Jan/17 07:32

Resolved:: 26/Oct/15 22:55