[YARN-4301] NM disk health checker should have a timeout - ASF JIRA

Add vote

Voters

Watch issue

Watchers

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: nodemanager
Labels:
- oct16-medium

Description

The disk health checker verifies a disk by executing mkdir and rmdir periodically.
If these operations does not return in a moderate timeout, the disk should be marked bad, and thus nodeInfo.nodeHealthy should flip to false.

I confirmed that current YARN does not have an implicit timeout (on JDK7, Linux 4.2, ext4) using Earthquake, our fault injector for distributed systems.
(I'll introduce the reproduction script in a while)

I consider we can fix this issue by making NodeHealthCheckerServer.isHealthy() return false if the value of this.getLastHealthReportTime() is too old.