[HDFS-9901] Move disk IO out of the heartbeat thread - ASF JIRA

Log work

Agile Board

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Add vote

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: datanode
Labels:
None

Description

During heavy disk IO, we noticed hearbeat thread hangs on checkBlock method, which checks the existence and length of a block before spins off a thread to do the actual transferring. In extreme cases, the heartbeat thread hang more than 10 minutes so the namenode marked the datanode as dead and started replicating its blocks, which caused more disk IO on other nodes and can potentially brought them down.

The patch contains two changes:
1. Makes DF asynchronous when monitoring the disk by creating a thread that checks the disk and updates the disk status periodically. When the heartbeat threads generates storage report, it then reads disk usage information from memory so that the heartbeat thread won't get blocked during heavy diskIO.
2. Makes the checks (which required disk accesses) in transferBlock() in DataNode into a separate thread so the heartbeat does not have to wait for this when heartbeating.