Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5214

Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 2.8.0, 3.0.0-alpha1
    • nodemanager
    • None
    • Reviewed

    Description

      In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater thread get blocked:
      1. Node Status Updater thread get blocked by 0x000000008065eae8

      "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting for monitor entry [0x00007f035945a000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
              - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
              at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
              at java.lang.Thread.run(Thread.java:745)
      

      2. The actual holder of this lock is DiskHealthMonitor:

      "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd runnable [0x00007f035e511000]
         java.lang.Thread.State: RUNNABLE
              at java.io.UnixFileSystem.createDirectory(Native Method)
              at java.io.File.mkdir(File.java:1316)
              at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
              at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
              - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
              at java.util.TimerThread.mainLoop(Timer.java:555)
              at java.util.TimerThread.run(Timer.java:505)
      

      This disk operation could take longer time than expectation especially in high IO throughput case and we should have fine-grained lock for related operations here.
      The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have similar fix here.

      Attachments

        1. YARN-5214.patch
          11 kB
          Junping Du
        2. YARN-5214-v2.patch
          12 kB
          Junping Du
        3. YARN-5214-v3.patch
          12 kB
          Junping Du

        Activity

          People

            junping_du Junping Du
            junping_du Junping Du
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: