Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3850

NM fails to read files from full disks which can lead to container logs being lost and other issues

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Container logs can be lost if disk has become full(~90% full).
      When application finishes, we upload logs after aggregation by calling AppLogAggregatorImpl#uploadLogsForContainers. But this call in turns checks the eligible directories on call to LocalDirsHandlerService#getLogDirs which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded.
      But on application finish, we also call AppLogAggregatorImpl#doAppLogAggregationPostCleanUp(). This deletes the application directory which contains container logs. This is because it calls LocalDirsHandlerService#getLogDirsForCleanup which returns the full disks as well.
      So we are left with neither aggregated logs for the app nor the individual container logs for the app.

      In addition to this, there are 2 more issues :

      1. ContainerLogsUtil#getContainerLogDirs does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces.
      2. RecoveredContainerLaunch#locatePidFile also does not consider full disks so it is possible that on container recovery, PID file is not found.
      1. YARN-3850.02.patch
        13 kB
        Varun Saxena
      2. YARN-3850.01.patch
        7 kB
        Varun Saxena

        Activity

        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Pulled this into 2.6.1 after fixing a minor conflict in TestLogAggregation.java.

        Ran compilation and TestLogAggregationService, TestContainerLogsPage before the push. Patch applied cleanly.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.6.1 after fixing a minor conflict in TestLogAggregation.java. Ran compilation and TestLogAggregationService, TestContainerLogsPage before the push. Patch applied cleanly.
        Hide
        sjlee0 Sangjin Lee added a comment -

        The merge to 2.6.0 is straightforward.

        Show
        sjlee0 Sangjin Lee added a comment - The merge to 2.6.0 is straightforward.
        Hide
        zxu zhihai xu added a comment -

        Hi Jason Lowe and Varun Saxena, I find ContainerLogsUtils#getContainerLogFile may also have this problem, because getContainerLogFile calls dirsHandler.getLogPathToRead, which uses configuration YarnConfiguration.NM_LOG_DIRS in LocalDirAllocator. YarnConfiguration.NM_LOG_DIRS will only include good disks after LocalDirsHandlerService#updateDirsAfterTest is called. I created YARN-3925 to fix this issue.
        It looks like ShuffleHandler doesn't have this problem because ShuffleHandler and LocalDirsHandlerService use different Configuration.
        LocalDirsHandlerService.serviceInit will create a copy of original Configuration:

            // Clone the configuration as we may do modifications to dirs-list
            Configuration conf = new Configuration(config);
        

        So ShuffleHandler will always use the original local dirs-list.

        Show
        zxu zhihai xu added a comment - Hi Jason Lowe and Varun Saxena , I find ContainerLogsUtils#getContainerLogFile may also have this problem, because getContainerLogFile calls dirsHandler.getLogPathToRead , which uses configuration YarnConfiguration.NM_LOG_DIRS in LocalDirAllocator . YarnConfiguration.NM_LOG_DIRS will only include good disks after LocalDirsHandlerService#updateDirsAfterTest is called. I created YARN-3925 to fix this issue. It looks like ShuffleHandler doesn't have this problem because ShuffleHandler and LocalDirsHandlerService use different Configuration. LocalDirsHandlerService.serviceInit will create a copy of original Configuration: // Clone the configuration as we may do modifications to dirs-list Configuration conf = new Configuration(config); So ShuffleHandler will always use the original local dirs-list.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2187 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2187/)
        YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2187 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2187/ ) YARN-3850 . NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #239 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/239/)
        YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #239 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/239/ ) YARN-3850 . NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java hadoop-yarn-project/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #2169 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2169/)
        YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332)

        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #2169 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2169/ ) YARN-3850 . NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #230 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/230/)
        YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #230 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/230/ ) YARN-3850 . NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk #971 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/971/)
        YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #971 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/971/ ) YARN-3850 . NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #241 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/241/)
        YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #241 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/241/ ) YARN-3850 . NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        Hide
        varun_saxena Varun Saxena added a comment -

        Thanks for the review and commit Jason Lowe

        Show
        varun_saxena Varun Saxena added a comment - Thanks for the review and commit Jason Lowe
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #8072 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8072/)
        YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8072 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8072/ ) YARN-3850 . NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java
        Hide
        jlowe Jason Lowe added a comment -

        Thanks, Varun! I committed this to trunk, branch-2, and branch-2.7.

        Show
        jlowe Jason Lowe added a comment - Thanks, Varun! I committed this to trunk, branch-2, and branch-2.7.
        Hide
        jlowe Jason Lowe added a comment -

        +1 lgtm. Committing this.

        Show
        jlowe Jason Lowe added a comment - +1 lgtm. Committing this.
        Hide
        varun_saxena Varun Saxena added a comment -

        There seems to be some issue with whitespace check. The line it shows in result doesnt have any whitespace. The one below has, but that hasnt been added by me.

        Show
        varun_saxena Varun Saxena added a comment - There seems to be some issue with whitespace check. The line it shows in result doesnt have any whitespace. The one below has, but that hasnt been added by me.
        Hide
        hadoopqa Hadoop QA added a comment -



        -1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 15m 41s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
        +1 javac 7m 24s There were no new javac warning messages.
        +1 javadoc 9m 31s There were no new javadoc warning messages.
        +1 release audit 0m 20s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 0m 38s There were no new checkstyle issues.
        -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
        +1 install 1m 32s mvn install still works.
        +1 eclipse:eclipse 0m 31s The patch built with eclipse:eclipse.
        +1 findbugs 1m 10s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 yarn tests 5m 59s Tests passed in hadoop-yarn-server-nodemanager.
            42m 50s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12742014/YARN-3850.02.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / 1403b84
        whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8353/artifact/patchprocess/whitespace.txt
        hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8353/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8353/testReport/
        Java 1.7.0_55
        uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/8353/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 15m 41s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 7m 24s There were no new javac warning messages. +1 javadoc 9m 31s There were no new javadoc warning messages. +1 release audit 0m 20s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 38s There were no new checkstyle issues. -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 32s mvn install still works. +1 eclipse:eclipse 0m 31s The patch built with eclipse:eclipse. +1 findbugs 1m 10s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 5m 59s Tests passed in hadoop-yarn-server-nodemanager.     42m 50s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12742014/YARN-3850.02.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 1403b84 whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8353/artifact/patchprocess/whitespace.txt hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8353/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8353/testReport/ Java 1.7.0_55 uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8353/console This message was automatically generated.
        Hide
        varun_saxena Varun Saxena added a comment -

        I think we need to also update ContainerLogsUtil.getContainerLogDirs so the NM will also serve up logs from full disks from its web interfaces.

        Yeah thats correct.
        I will work on both of these as part of this JIRA because I will have to otherwise raise 2 more JIRAs'.

        Show
        varun_saxena Varun Saxena added a comment - I think we need to also update ContainerLogsUtil.getContainerLogDirs so the NM will also serve up logs from full disks from its web interfaces. Yeah thats correct. I will work on both of these as part of this JIRA because I will have to otherwise raise 2 more JIRAs'.
        Hide
        jlowe Jason Lowe added a comment -

        Thanks for the patch, Varun. Looks good overall, but I think we need to also update ContainerLogsUtil.getContainerLogDirs so the NM will also serve up logs from full disks from its web interfaces.

        Raise a separate JIRA for this or fix it as part of this one ?

        I'm OK with fixing it as part of this (could change ticket summary to "NM fails to read files from full disks" or something) or fixing it as a separate one, but that ticket would also be pretty critical to fix.

        Show
        jlowe Jason Lowe added a comment - Thanks for the patch, Varun. Looks good overall, but I think we need to also update ContainerLogsUtil.getContainerLogDirs so the NM will also serve up logs from full disks from its web interfaces. Raise a separate JIRA for this or fix it as part of this one ? I'm OK with fixing it as part of this (could change ticket summary to "NM fails to read files from full disks" or something) or fixing it as a separate one, but that ticket would also be pretty critical to fix.
        Hide
        varun_saxena Varun Saxena added a comment -

        Raise a separate JIRA for this or fix it as part of this one ?

        Show
        varun_saxena Varun Saxena added a comment - Raise a separate JIRA for this or fix it as part of this one ?
        Hide
        hadoopqa Hadoop QA added a comment -



        +1 overall



        Vote Subsystem Runtime Comment
        0 pre-patch 16m 27s Pre-patch trunk compilation is healthy.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 tests included 0m 0s The patch appears to include 1 new or modified test files.
        +1 javac 7m 58s There were no new javac warning messages.
        +1 javadoc 9m 55s There were no new javadoc warning messages.
        +1 release audit 0m 21s The applied patch does not increase the total number of release audit warnings.
        +1 checkstyle 0m 42s There were no new checkstyle issues.
        +1 whitespace 0m 0s The patch has no lines that end in whitespace.
        +1 install 1m 36s mvn install still works.
        +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
        +1 findbugs 1m 15s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
        +1 yarn tests 6m 17s Tests passed in hadoop-yarn-server-nodemanager.
            45m 6s  



        Subsystem Report/Notes
        Patch URL http://issues.apache.org/jira/secure/attachment/12741960/YARN-3850.01.patch
        Optional Tests javadoc javac unit findbugs checkstyle
        git revision trunk / aa5b15b
        hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8351/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8351/testReport/
        Java 1.7.0_55
        uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/8351/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 27s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 1 new or modified test files. +1 javac 7m 58s There were no new javac warning messages. +1 javadoc 9m 55s There were no new javadoc warning messages. +1 release audit 0m 21s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 42s There were no new checkstyle issues. +1 whitespace 0m 0s The patch has no lines that end in whitespace. +1 install 1m 36s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 1m 15s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 6m 17s Tests passed in hadoop-yarn-server-nodemanager.     45m 6s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12741960/YARN-3850.01.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / aa5b15b hadoop-yarn-server-nodemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8351/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8351/testReport/ Java 1.7.0_55 uname Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8351/console This message was automatically generated.
        Hide
        varun_saxena Varun Saxena added a comment -

        Below also seems to be a problem.
        RecoveredContainerLaunch#locatePidFile

        Show
        varun_saxena Varun Saxena added a comment - Below also seems to be a problem. RecoveredContainerLaunch#locatePidFile
        Hide
        varun_saxena Varun Saxena added a comment -

        Shuffle Handler would only read data from disk. LocalDirAllocator does not check size at the time of reading. So while we may search disks which have gone bad(maybe not a big concern), we will not fail to search disks that were bad/full on startup and later became good.
        getLocalPathToRead in LocalDirAllocator seems to be only checking permissions and existence(on call to confChanged).

        Show
        varun_saxena Varun Saxena added a comment - Shuffle Handler would only read data from disk. LocalDirAllocator does not check size at the time of reading. So while we may search disks which have gone bad(maybe not a big concern), we will not fail to search disks that were bad/full on startup and later became good. getLocalPathToRead in LocalDirAllocator seems to be only checking permissions and existence(on call to confChanged).
        Hide
        varun_saxena Varun Saxena added a comment -

        Yes this also looks like a problem. We should not use LocalDirAllocator for ShuffleHandler.
        I will look for other areas where similar problem can happen and update if I find something.

        Show
        varun_saxena Varun Saxena added a comment - Yes this also looks like a problem. We should not use LocalDirAllocator for ShuffleHandler. I will look for other areas where similar problem can happen and update if I find something.
        Hide
        varun_saxena Varun Saxena added a comment -

        Yes this also looks like a problem. We should not use LocalDirAllocator for ShuffleHandler.
        I will look for other areas where similar problem can happen and update if I find something.

        Show
        varun_saxena Varun Saxena added a comment - Yes this also looks like a problem. We should not use LocalDirAllocator for ShuffleHandler. I will look for other areas where similar problem can happen and update if I find something.
        Hide
        jlowe Jason Lowe added a comment -

        After thinking about this I was wondering if ShuffleHandler had a similar issue, since it too is looking for places to read files. It looks like it might not be affected in the same way, since it doesn't use LocalDirsHandlerService and just uses the underlying LocalDirAllocator. I don't think the latter will auto-update the list of bad/good directories, since it doesn't appear to update unless something tries to write through it or the conf is updated.

        I think it could be problematic in that the ShuffleHandler will likely continue to search disks that later go bad or fail to search disks that were bad/full on startup and later became good. If we start persisting bad/full disks across NM restart then it seems likely a map task could deposit shuffle data on a disk the ShuffleHandler will fail to search with its stale view of the disks on startup. What do you think? Should be addressed as a separate JIRA if a problem, but I'm trying to think of other places in the NM where we would have a similar bug and only searching good dirs for reading rather than also checking the full disks.

        Show
        jlowe Jason Lowe added a comment - After thinking about this I was wondering if ShuffleHandler had a similar issue, since it too is looking for places to read files. It looks like it might not be affected in the same way, since it doesn't use LocalDirsHandlerService and just uses the underlying LocalDirAllocator. I don't think the latter will auto-update the list of bad/good directories, since it doesn't appear to update unless something tries to write through it or the conf is updated. I think it could be problematic in that the ShuffleHandler will likely continue to search disks that later go bad or fail to search disks that were bad/full on startup and later became good. If we start persisting bad/full disks across NM restart then it seems likely a map task could deposit shuffle data on a disk the ShuffleHandler will fail to search with its stale view of the disks on startup. What do you think? Should be addressed as a separate JIRA if a problem, but I'm trying to think of other places in the NM where we would have a similar bug and only searching good dirs for reading rather than also checking the full disks.

          People

          • Assignee:
            varun_saxena Varun Saxena
            Reporter:
            varun_saxena Varun Saxena
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development