Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5214

Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: nodemanager
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater thread get blocked:
      1. Node Status Updater thread get blocked by 0x000000008065eae8

      "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting for monitor entry [0x00007f035945a000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
              - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
              at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
              at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
              at java.lang.Thread.run(Thread.java:745)
      

      2. The actual holder of this lock is DiskHealthMonitor:

      "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd runnable [0x00007f035e511000]
         java.lang.Thread.State: RUNNABLE
              at java.io.UnixFileSystem.createDirectory(Native Method)
              at java.io.File.mkdir(File.java:1316)
              at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
              at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
              at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
              - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
              at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
              at java.util.TimerThread.mainLoop(Timer.java:555)
              at java.util.TimerThread.run(Timer.java:505)
      

      This disk operation could take longer time than expectation especially in high IO throughput case and we should have fine-grained lock for related operations here.
      The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have similar fix here.

      1. YARN-5214.patch
        11 kB
        Junping Du
      2. YARN-5214-v2.patch
        12 kB
        Junping Du
      3. YARN-5214-v3.patch
        12 kB
        Junping Du

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #10052 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10052/)
        YARN-5214. Fixed locking in DirectoryCollection to avoid hanging NMs (vinodkv: rev ce9c006430d13a28bc1ca57c5c70cc1b7cba1692)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10052 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10052/ ) YARN-5214 . Fixed locking in DirectoryCollection to avoid hanging NMs (vinodkv: rev ce9c006430d13a28bc1ca57c5c70cc1b7cba1692) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DirectoryCollection.java
        Hide
        djp Junping Du added a comment -

        Thanks Vinod Kumar Vavilapalli for review and comments!

        Show
        djp Junping Du added a comment - Thanks Vinod Kumar Vavilapalli for review and comments!
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Committed this to trunk, branch-2 and branch-2.8 (performance fix). Thanks Junping Du!

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Committed this to trunk, branch-2 and branch-2.8 (performance fix). Thanks Junping Du !
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        No test-cases for this locking related performance-fix. Correctness is validated by existing tests.

        +1 for the latest patch, checking this in.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - No test-cases for this locking related performance-fix. Correctness is validated by existing tests. +1 for the latest patch, checking this in.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 21s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 7m 35s trunk passed
        +1 compile 0m 34s trunk passed
        +1 checkstyle 0m 18s trunk passed
        +1 mvnsite 0m 31s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 0m 47s trunk passed
        +1 javadoc 0m 16s trunk passed
        +1 mvninstall 0m 23s the patch passed
        +1 compile 0m 23s the patch passed
        +1 javac 0m 23s the patch passed
        +1 checkstyle 0m 14s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 7 unchanged - 2 fixed = 7 total (was 9)
        +1 mvnsite 0m 26s the patch passed
        +1 mvneclipse 0m 11s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 49s the patch passed
        +1 javadoc 0m 14s the patch passed
        +1 unit 13m 11s hadoop-yarn-server-nodemanager in the patch passed.
        +1 asflicense 0m 15s The patch does not generate ASF License warnings.
        27m 21s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812671/YARN-5214-v3.patch
        JIRA Issue YARN-5214
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 80c595af3e61 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 9560f25
        Default Java 1.8.0_91
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12190/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/12190/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 35s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 31s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 0m 47s trunk passed +1 javadoc 0m 16s trunk passed +1 mvninstall 0m 23s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 checkstyle 0m 14s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 7 unchanged - 2 fixed = 7 total (was 9) +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 49s the patch passed +1 javadoc 0m 14s the patch passed +1 unit 13m 11s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 27m 21s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812671/YARN-5214-v3.patch JIRA Issue YARN-5214 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 80c595af3e61 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9560f25 Default Java 1.8.0_91 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12190/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12190/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        The latest patch looks good to me. +1.

        Manually rekicking Jenkins, as the patch has been around for a while and trunk may have moved on.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - The latest patch looks good to me. +1. Manually rekicking Jenkins, as the patch has been around for a while and trunk may have moved on.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 29s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 6m 51s trunk passed
        +1 compile 0m 27s trunk passed
        +1 checkstyle 0m 16s trunk passed
        +1 mvnsite 0m 29s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 47s trunk passed
        +1 javadoc 0m 16s trunk passed
        +1 mvninstall 0m 23s the patch passed
        +1 compile 0m 23s the patch passed
        +1 javac 0m 23s the patch passed
        +1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 7 unchanged - 2 fixed = 7 total (was 9)
        +1 mvnsite 0m 26s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 47s the patch passed
        +1 javadoc 0m 15s the patch passed
        +1 unit 13m 9s hadoop-yarn-server-nodemanager in the patch passed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        26m 29s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:e2f6409
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812671/YARN-5214-v3.patch
        JIRA Issue YARN-5214
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 2a386f659cbc 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 17eae9e
        Default Java 1.8.0_91
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12113/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/12113/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 29s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 51s trunk passed +1 compile 0m 27s trunk passed +1 checkstyle 0m 16s trunk passed +1 mvnsite 0m 29s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 47s trunk passed +1 javadoc 0m 16s trunk passed +1 mvninstall 0m 23s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed +1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 7 unchanged - 2 fixed = 7 total (was 9) +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 47s the patch passed +1 javadoc 0m 15s the patch passed +1 unit 13m 9s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 26m 29s Subsystem Report/Notes Docker Image:yetus/hadoop:e2f6409 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812671/YARN-5214-v3.patch JIRA Issue YARN-5214 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 2a386f659cbc 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 17eae9e Default Java 1.8.0_91 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12113/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12113/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        djp Junping Du added a comment -

        Sounds like test failure is related. Fix in v3 patch with keep using CopyOnWriteArrayList.

        Show
        djp Junping Du added a comment - Sounds like test failure is related. Fix in v3 patch with keep using CopyOnWriteArrayList.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 6m 21s trunk passed
        +1 compile 0m 25s trunk passed
        +1 checkstyle 0m 15s trunk passed
        +1 mvnsite 0m 28s trunk passed
        +1 mvneclipse 0m 12s trunk passed
        +1 findbugs 0m 42s trunk passed
        +1 javadoc 0m 16s trunk passed
        +1 mvninstall 0m 22s the patch passed
        +1 compile 0m 23s the patch passed
        +1 javac 0m 23s the patch passed
        -1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 2 new + 7 unchanged - 2 fixed = 9 total (was 9)
        +1 mvnsite 0m 28s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 48s the patch passed
        +1 javadoc 0m 14s the patch passed
        -1 unit 12m 57s hadoop-yarn-server-nodemanager in the patch failed.
        +1 asflicense 0m 14s The patch does not generate ASF License warnings.
        25m 21s



        Reason Tests
        Failed junit tests hadoop.yarn.server.nodemanager.TestDirectoryCollection



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:e2f6409
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812284/YARN-5214-v2.patch
        JIRA Issue YARN-5214
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 2fb0bbebf445 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / b2c596c
        Default Java 1.8.0_91
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12098/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/12098/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12098/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12098/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/12098/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 21s trunk passed +1 compile 0m 25s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 28s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 0m 42s trunk passed +1 javadoc 0m 16s trunk passed +1 mvninstall 0m 22s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed -1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 2 new + 7 unchanged - 2 fixed = 9 total (was 9) +1 mvnsite 0m 28s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 48s the patch passed +1 javadoc 0m 14s the patch passed -1 unit 12m 57s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 14s The patch does not generate ASF License warnings. 25m 21s Reason Tests Failed junit tests hadoop.yarn.server.nodemanager.TestDirectoryCollection Subsystem Report/Notes Docker Image:yetus/hadoop:e2f6409 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12812284/YARN-5214-v2.patch JIRA Issue YARN-5214 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 2fb0bbebf445 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b2c596c Default Java 1.8.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12098/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/12098/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12098/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12098/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12098/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        djp Junping Du added a comment -

        In v2 patch, fixing two issues:
        1. Replace lock on dirsChangeListeners with thread-safe set as mentioned by Vinod's comments.
        2. Replace CopyOnWriteArrayList for localDirs, errorDirs and fullDirs with ArrayList as we don't need it before and now.

        Show
        djp Junping Du added a comment - In v2 patch, fixing two issues: 1. Replace lock on dirsChangeListeners with thread-safe set as mentioned by Vinod's comments. 2. Replace CopyOnWriteArrayList for localDirs, errorDirs and fullDirs with ArrayList as we don't need it before and now.
        Hide
        djp Junping Du added a comment -

        Thanks Wangda Tan and Vinod Kumar Vavilapalli for review and comments!

        dirsChangeListeners doesn't change except on service-start and stop. So no need to grab the global / read / write lock, we can simply make it use a thread-safe collection?

        That's good point. Will get rid of lock for dirsChangeListeners in next patch.

        In createNonExistentDirs(), you don't need to make a manual copy of localDirs, the iterator() method already does it for you.

        I would like to get rid of CopyOnWriteArrayList and replace with normal ArrayList given we already have read/write lock on related dirs. In createNonExistentDirs(), explicitly copy localDirs() under a read lock and make createDir() lock free is something more clear. Isn't it?

        Can you not use the java8 stuff (diamond operator etc), so that this patch can be backported to the older releases?

        Diamond operator is supported since Java 7...

        Show
        djp Junping Du added a comment - Thanks Wangda Tan and Vinod Kumar Vavilapalli for review and comments! dirsChangeListeners doesn't change except on service-start and stop. So no need to grab the global / read / write lock, we can simply make it use a thread-safe collection? That's good point. Will get rid of lock for dirsChangeListeners in next patch. In createNonExistentDirs(), you don't need to make a manual copy of localDirs, the iterator() method already does it for you. I would like to get rid of CopyOnWriteArrayList and replace with normal ArrayList given we already have read/write lock on related dirs. In createNonExistentDirs(), explicitly copy localDirs() under a read lock and make createDir() lock free is something more clear. Isn't it? Can you not use the java8 stuff (diamond operator etc), so that this patch can be backported to the older releases? Diamond operator is supported since Java 7...
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Tx for working on this Junping Du.

        The patch looks good overall, very close, but couple of comments follow. We can do better in some areas I think

        • dirsChangeListeners doesn't change except on service-start and stop. So no need to grab the global / read / write lock, we can simply make it use a thread-safe collection?
        • In createNonExistentDirs(), you don't need to make a manual copy of localDirs, the iterator() method already does it for you.
        • Can you not use the java8 stuff (diamond operator etc), so that this patch can be backported to the older releases?
        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Tx for working on this Junping Du . The patch looks good overall, very close, but couple of comments follow. We can do better in some areas I think dirsChangeListeners doesn't change except on service-start and stop. So no need to grab the global / read / write lock, we can simply make it use a thread-safe collection? In createNonExistentDirs() , you don't need to make a manual copy of localDirs, the iterator() method already does it for you. Can you not use the java8 stuff (diamond operator etc), so that this patch can be backported to the older releases?
        Hide
        leftnoteasy Wangda Tan added a comment -

        Junping Du, make sense to me.

        Looked at details of the patch as well. LGTM +1, I suggest to get review from another commit before commit it.

        Show
        leftnoteasy Wangda Tan added a comment - Junping Du , make sense to me. Looked at details of the patch as well. LGTM +1, I suggest to get review from another commit before commit it.
        Hide
        djp Junping Du added a comment -

        Thanks Wangda Tan for review and comments!

        even after R/W lock changes, when anything bad happens on disks, DirectoryCollection will be stuck under write locks, so NodeStatusUpdater will be blocked as well.

        Not really. From jstack above, you can see the pending operation on busy IO happen in below is out of any lock now.

        Map<String, DiskErrorInformation> dirsFailedCheck = testDirs(allLocalDirs,
                 preCheckGoodDirs);
        

        So NodeStatusUpdater won't get blocked when testDirs pending on operation of mkdir.

        1) In short term, errorDirs/fullDirs/localDirs are copy-on-write list, so we don't need to acquire lock getGoodDirs/getFailedDirs/getFailedDirs. This could lead to inconsistency data in rare cases, but I think in general this is safe and inconsistency data will be updated in next heartbeat.

        In general, read/write lock is more flexible and more consistent as we have several resources under race condition. Copy-on-write list only can guarantee no modification exception happen between a read and write operation on the same list, but no way to provide consistent semantic across lists. Thus, I would prefer to use read/write lock here and CopyOnWriteArrayList can be replaced with plain Arraylist. Isn't it?

        2) In longer term, we may need to consider a DirectoryCollection stuck under busy IO is unhealthy state, NodeStatusUpdater should be able to report such status to RM, so RM will avoid allocating any new containers to such nodes.

        I agree we should provide better IO control on each node of YARN cluster. We can report some unhealthy status when IO get stuck or even better to count IO load as a resource for better/smart scheduling. However, how to better react for the very busy IO case is a different topic for the problem try to get resolved in this JIRA. In any case, NM heartbeat is not supposed to be cut-off unless daemon crash.

        Show
        djp Junping Du added a comment - Thanks Wangda Tan for review and comments! even after R/W lock changes, when anything bad happens on disks, DirectoryCollection will be stuck under write locks, so NodeStatusUpdater will be blocked as well. Not really. From jstack above, you can see the pending operation on busy IO happen in below is out of any lock now. Map<String, DiskErrorInformation> dirsFailedCheck = testDirs(allLocalDirs, preCheckGoodDirs); So NodeStatusUpdater won't get blocked when testDirs pending on operation of mkdir. 1) In short term, errorDirs/fullDirs/localDirs are copy-on-write list, so we don't need to acquire lock getGoodDirs/getFailedDirs/getFailedDirs. This could lead to inconsistency data in rare cases, but I think in general this is safe and inconsistency data will be updated in next heartbeat. In general, read/write lock is more flexible and more consistent as we have several resources under race condition. Copy-on-write list only can guarantee no modification exception happen between a read and write operation on the same list, but no way to provide consistent semantic across lists. Thus, I would prefer to use read/write lock here and CopyOnWriteArrayList can be replaced with plain Arraylist. Isn't it? 2) In longer term, we may need to consider a DirectoryCollection stuck under busy IO is unhealthy state, NodeStatusUpdater should be able to report such status to RM, so RM will avoid allocating any new containers to such nodes. I agree we should provide better IO control on each node of YARN cluster. We can report some unhealthy status when IO get stuck or even better to count IO load as a resource for better/smart scheduling. However, how to better react for the very busy IO case is a different topic for the problem try to get resolved in this JIRA. In any case, NM heartbeat is not supposed to be cut-off unless daemon crash.
        Hide
        leftnoteasy Wangda Tan added a comment -

        Thanks Junping Du,

        I think the patch added RW lock can generally reduce the time spent on locking. However, I think this may not be able to solve the entire problem.

        Per my understanding, even after R/W lock changes, when anything bad happens on disks, DirectoryCollection will be stuck under write locks, so NodeStatusUpdater will be blocked as well.

        I think there're two fixes that we can make to tackle the problem:
        1) In short term, errorDirs/fullDirs/localDirs are copy-on-write list, so we don't need to acquire lock getGoodDirs/getFailedDirs/getFailedDirs. This could lead to inconsistency data in rare cases, but I think in general this is safe and inconsistency data will be updated in next heartbeat.

        2) In longer term, we may need to consider a DirectoryCollection stuck under busy IO is unhealthy state, NodeStatusUpdater should be able to report such status to RM, so RM will avoid allocating any new containers to such nodes. Nathan Roberts suggested the same thing.

        Thoughts?

        Show
        leftnoteasy Wangda Tan added a comment - Thanks Junping Du , I think the patch added RW lock can generally reduce the time spent on locking. However, I think this may not be able to solve the entire problem. Per my understanding, even after R/W lock changes, when anything bad happens on disks, DirectoryCollection will be stuck under write locks, so NodeStatusUpdater will be blocked as well. I think there're two fixes that we can make to tackle the problem: 1) In short term, errorDirs/fullDirs/localDirs are copy-on-write list, so we don't need to acquire lock getGoodDirs/getFailedDirs/getFailedDirs. This could lead to inconsistency data in rare cases, but I think in general this is safe and inconsistency data will be updated in next heartbeat. 2) In longer term, we may need to consider a DirectoryCollection stuck under busy IO is unhealthy state, NodeStatusUpdater should be able to report such status to RM, so RM will avoid allocating any new containers to such nodes. Nathan Roberts suggested the same thing. Thoughts?
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 23s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 6m 53s trunk passed
        +1 compile 0m 26s trunk passed
        +1 checkstyle 0m 16s trunk passed
        +1 mvnsite 0m 27s trunk passed
        +1 mvneclipse 0m 12s trunk passed
        +1 findbugs 0m 43s trunk passed
        +1 javadoc 0m 17s trunk passed
        +1 mvninstall 0m 23s the patch passed
        +1 compile 0m 23s the patch passed
        +1 javac 0m 23s the patch passed
        -1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 2 new + 7 unchanged - 2 fixed = 9 total (was 9)
        +1 mvnsite 0m 25s the patch passed
        +1 mvneclipse 0m 9s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 46s the patch passed
        +1 javadoc 0m 14s the patch passed
        +1 unit 12m 57s hadoop-yarn-server-nodemanager in the patch passed.
        +1 asflicense 0m 15s The patch does not generate ASF License warnings.
        26m 3s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:2c91fd8
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12810462/YARN-5214.patch
        JIRA Issue YARN-5214
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 7110c16259f9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 8e8cb4c
        Default Java 1.8.0_91
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12013/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12013/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/12013/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 53s trunk passed +1 compile 0m 26s trunk passed +1 checkstyle 0m 16s trunk passed +1 mvnsite 0m 27s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 0m 43s trunk passed +1 javadoc 0m 17s trunk passed +1 mvninstall 0m 23s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed -1 checkstyle 0m 13s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 2 new + 7 unchanged - 2 fixed = 9 total (was 9) +1 mvnsite 0m 25s the patch passed +1 mvneclipse 0m 9s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 46s the patch passed +1 javadoc 0m 14s the patch passed +1 unit 12m 57s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 26m 3s Subsystem Report/Notes Docker Image:yetus/hadoop:2c91fd8 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12810462/YARN-5214.patch JIRA Issue YARN-5214 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 7110c16259f9 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8e8cb4c Default Java 1.8.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12013/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12013/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12013/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        djp Junping Du added a comment -

        Thanks Nathan Roberts.
        Attach a patch to replace whole synchronized methods with fine grained read/write lock on shared resources (localDirs, errorDirs, fullDirs, numFailures, etc.) between threads. The key idea here is to release lock on testDirs() which is a heavy IO operation sometimes get blocked in heavy IO case. Please note that this is not the finest-grain lock as we could aggressively make different shared resources access wait on different locks, but that could make some logic here a bit tricky and could bring risk of deadlock.

        Show
        djp Junping Du added a comment - Thanks Nathan Roberts . Attach a patch to replace whole synchronized methods with fine grained read/write lock on shared resources (localDirs, errorDirs, fullDirs, numFailures, etc.) between threads. The key idea here is to release lock on testDirs() which is a heavy IO operation sometimes get blocked in heavy IO case. Please note that this is not the finest-grain lock as we could aggressively make different shared resources access wait on different locks, but that could make some logic here a bit tricky and could bring risk of deadlock.
        Hide
        nroberts Nathan Roberts added a comment -

        Junping Du. I agree it makes sense to keep the heartbeat path as lock free as possible.

        Show
        nroberts Nathan Roberts added a comment - Junping Du . I agree it makes sense to keep the heartbeat path as lock free as possible.
        Hide
        djp Junping Du added a comment -

        Thanks Nathan Roberts for sharing the solution on this!
        I agree that to fix the root cause of this particular issue, we may need to configure deadline IO scheduler in Linux. Otherwise, IO waiting too long time should definitely cause other serious issues, like we also noticed that ResourceLocalizationService get blocked as well.
        On the other side, we need to check if hanging NM heartbeat or localizer in case of busy IO with wrong IO scheduler setting is something we really want here: at least, we should replace the synchronized method lock with something we can try to lock and print some useful debug log if pending too long time. May be we can do more with the same principle of HDFS-9239 that to release unnecessary lock for NM-RM heartbeat as much as possible? Thoughts?

        Show
        djp Junping Du added a comment - Thanks Nathan Roberts for sharing the solution on this! I agree that to fix the root cause of this particular issue, we may need to configure deadline IO scheduler in Linux. Otherwise, IO waiting too long time should definitely cause other serious issues, like we also noticed that ResourceLocalizationService get blocked as well. On the other side, we need to check if hanging NM heartbeat or localizer in case of busy IO with wrong IO scheduler setting is something we really want here: at least, we should replace the synchronized method lock with something we can try to lock and print some useful debug log if pending too long time. May be we can do more with the same principle of HDFS-9239 that to release unnecessary lock for NM-RM heartbeat as much as possible? Thoughts?
        Hide
        nroberts Nathan Roberts added a comment -

        I'm not suggesting this change shouldn't be made but keep in mind that if the NM is having trouble performing this type of action within the timeout (10 minutes or so), then the node is not very healthy and probably shouldn't be given anything more to run until the situation improves. It's going to have trouble doing all sorts of other things as well so having it look unhealthy in some fashion isn't all bad. If we somehow keep heartbeats completely free of I/O, then the RM will keep assigning containers that will likely run into exactly the same slowness.

        We used to see similar issues that we resolved by switching to the deadline I/O scheduler (assuming linux). See https://issues.apache.org/jira/browse/HDFS-9239?focusedCommentId=15218302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15218302

        Show
        nroberts Nathan Roberts added a comment - I'm not suggesting this change shouldn't be made but keep in mind that if the NM is having trouble performing this type of action within the timeout (10 minutes or so), then the node is not very healthy and probably shouldn't be given anything more to run until the situation improves. It's going to have trouble doing all sorts of other things as well so having it look unhealthy in some fashion isn't all bad. If we somehow keep heartbeats completely free of I/O, then the RM will keep assigning containers that will likely run into exactly the same slowness. We used to see similar issues that we resolved by switching to the deadline I/O scheduler (assuming linux). See https://issues.apache.org/jira/browse/HDFS-9239?focusedCommentId=15218302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15218302

          People

          • Assignee:
            djp Junping Du
            Reporter:
            djp Junping Du
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development