Code level summary of the patch attached:
(1) Node Manager launches DiskHealthCheckeraService that periodically launches/executes the disk-health-check-code. A new configuration property yarn.nodemanger.disk-health-checker.interval-ms is added for the frequency with which this code is executed, with a default value of 120*1000ms(i.e. 2minutes).
(2) LocalStorage is a new class that manages a list of local file system directories and provides api for checking the health of those directories (mostly similar to TaskTracker.LocalStorgae class of 0.20 except that this class' checkDirs() doesn't throw DiskErrorException when all directories fail but returns true if a new disk-failure is seen.
(3) DiskHealthCheckerService maintains 2 LocalStorage objects ---- one for nm-local-dirs and second one for nm-log-dirs.
(4) ContainerExecutor is initialized with the DiskHealthCheckerService object. So both DefaultContainerExecutor.java and LinuxContainerExecutor.java get good nm-local-dirs and nm-log-dirs from the DiskHealthChecker always.
(5) container-executor binary gets good nm-local-dirs and good nm-log-dirs as a parameter and uses these good dirs only. So these are removed from the configuration file(i.e. not needed to be configured in container-executor.cfg configuration file).
(6) Whenever a new container gets launched, the good nm-local-dirs and good nm-log-dirs are updated in the configuration sothat containers won't access bad disks. Everybody (localizer, webserver) goes through DiskHealthChecker to access nm-local-dirs and nm-log-dirs.
(7) On the NodeManager web UI, NodeHealthReport will be showing the lost of good nm-local-dirs and nm-log-dirs in addition to true/false about the health of the node.
(8) A new unit test TestDiskFailures is added that makes disks(both nm-local-dirs and nm-log-dirs) fail and checks/validates if the NodeManager/DiskHealthChecker can identifies these disk-failures or not.
Tested the patch with (1) DefaultContainerExecutor and (2) LinuxContainerExecutor on my single node cluster. The functionality seems to be working fine with disk failures getting identified by NodeManager and the bad nm-local-dirs and bad nm-log-dirs getting avoided for new containers.