[YARN-5214] Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: nodemanager
Labels:
None

Target Version/s:

2.8.0
Hadoop Flags:

Reviewed

Description

In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater thread get blocked:
1. Node Status Updater thread get blocked by 0x000000008065eae8

"Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting for monitor entry [0x00007f035945a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
        - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
        at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
        at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
        at java.lang.Thread.run(Thread.java:745)

2. The actual holder of this lock is DiskHealthMonitor:

"DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd runnable [0x00007f035e511000]
   java.lang.Thread.State: RUNNABLE
        at java.io.UnixFileSystem.createDirectory(Native Method)
        at java.io.File.mkdir(File.java:1316)
        at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
        at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
        at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
        at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
        - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
        at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
        at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
        at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
        at java.util.TimerThread.mainLoop(Timer.java:555)
        at java.util.TimerThread.run(Timer.java:505)

This disk operation could take longer time than expectation especially in high IO throughput case and we should have fine-grained lock for related operations here.
The same issue on HDFS get raised and fixed in ~~HDFS-7489~~, and we probably should have similar fix here.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-5214.patch
14/Jun/16 16:25
11 kB
Junping Du
YARN-5214-v2.patch
21/Jun/16 20:19
12 kB
Junping Du
YARN-5214-v3.patch
22/Jun/16 23:53
12 kB
Junping Du

Activity

People

Assignee:: Junping Du

Reporter:: Junping Du

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 08/Jun/16 16:42

Updated:: 25/Oct/19 20:27

Resolved:: 06/Jul/16 00:17