The disk health checker verifies a disk by executing mkdir and rmdir periodically.
If these operations does not return in a moderate timeout, the disk should be marked bad, and thus nodeInfo.nodeHealthy should flip to false.
I confirmed that current YARN does not have an implicit timeout (on JDK7, Linux 4.2, ext4) using Earthquake, our fault injector for distributed systems.
(I'll introduce the reproduction script in a while)
I consider we can fix this issue by making NodeHealthCheckerServer.isHealthy() return false if the value of this.getLastHealthReportTime() is too old.