Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater thread get blocked:
1. Node Status Updater thread get blocked by 0x000000008065eae8
"Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting for monitor entry [0x00007f035945a000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170) - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287) at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643) at java.lang.Thread.run(Thread.java:745)
2. The actual holder of this lock is DiskHealthMonitor:
"DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd runnable [0x00007f035e511000] java.lang.Thread.State: RUNNABLE at java.io.UnixFileSystem.createDirectory(Native Method) at java.io.File.mkdir(File.java:1316) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312) at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231) - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505)
This disk operation could take longer time than expectation especially in high IO throughput case and we should have fine-grained lock for related operations here.
The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have similar fix here.