Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7087

NM failed to perform log aggregation due to absent container

    Details

    • Target Version/s:

      Description

      Saw a case where the NM failed to aggregate the logs for a container because it claimed it was absent:

      2017-08-23 18:35:38,283 [AsyncDispatcher event handler] WARN logaggregation.LogAggregationService: Log aggregation cannot be started for container_e07_1503326514161_502342_01_000001, as its an absent container
      

      Containers should not be allowed to disappear if they're not done being fully processed by the NM.

      1. YARN-7087.001.patch
        21 kB
        Jason Lowe
      2. YARN-7087.002.patch
        22 kB
        Jason Lowe

        Issue Links

          Activity

          Hide
          djp Junping Du added a comment -

          Sounds like we forget to merge patch to branch-2.8.2. Just merge it.

          Show
          djp Junping Du added a comment - Sounds like we forget to merge patch to branch-2.8.2. Just merge it.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12245 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12245/)
          YARN-7087. NM failed to perform log aggregation due to absent container. (epayne: rev e864f81471407a384395fefe1ceb3b66fc7f87f2)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/event/LogHandlerContainerFinishedEvent.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #12245 (See https://builds.apache.org/job/Hadoop-trunk-Commit/12245/ ) YARN-7087 . NM failed to perform log aggregation due to absent container. (epayne: rev e864f81471407a384395fefe1ceb3b66fc7f87f2) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/LogAggregationService.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/event/LogHandlerContainerFinishedEvent.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/loghandler/TestNonAggregatingLogHandler.java
          Hide
          eepayne Eric Payne added a comment -

          Committed to trunk (3.0.0-beta1), branch-2 (2.9.0), branch-2.8 (2.8.2)

          Show
          eepayne Eric Payne added a comment - Committed to trunk (3.0.0-beta1), branch-2 (2.9.0), branch-2.8 (2.8.2)
          Hide
          eepayne Eric Payne added a comment -

          Jason Lowe, thanks for finding, reporting, and fixing this issue.

          +1. The patch LGTM.

          I will commit this afternoon.

          Show
          eepayne Eric Payne added a comment - Jason Lowe , thanks for finding, reporting, and fixing this issue. +1. The patch LGTM. I will commit this afternoon.
          Hide
          jlowe Jason Lowe added a comment -

          Increasing severity to Blocker since this is not that rare when apps are killed and logs are getting lost when it happens.

          Show
          jlowe Jason Lowe added a comment - Increasing severity to Blocker since this is not that rare when apps are killed and logs are getting lost when it happens.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 28s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                trunk Compile Tests
          +1 mvninstall 14m 49s trunk passed
          +1 compile 0m 47s trunk passed
          +1 checkstyle 0m 28s trunk passed
          +1 mvnsite 0m 32s trunk passed
          -1 findbugs 0m 56s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 1 extant Findbugs warnings.
          +1 javadoc 0m 20s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 28s the patch passed
          +1 compile 0m 45s the patch passed
          +1 javac 0m 45s the patch passed
          +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 292 unchanged - 3 fixed = 292 total (was 295)
          +1 mvnsite 0m 28s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 56s the patch passed
          +1 javadoc 0m 18s the patch passed
                Other Tests
          +1 unit 14m 45s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          38m 5s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-7087
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883573/YARN-7087.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 75c49998b053 3.13.0-123-generic #172-Ubuntu SMP Mon Jun 26 18:04:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8196a07
          Default Java 1.8.0_144
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/17117/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17117/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/17117/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 28s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       trunk Compile Tests +1 mvninstall 14m 49s trunk passed +1 compile 0m 47s trunk passed +1 checkstyle 0m 28s trunk passed +1 mvnsite 0m 32s trunk passed -1 findbugs 0m 56s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 1 extant Findbugs warnings. +1 javadoc 0m 20s trunk passed       Patch Compile Tests +1 mvninstall 0m 28s the patch passed +1 compile 0m 45s the patch passed +1 javac 0m 45s the patch passed +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 292 unchanged - 3 fixed = 292 total (was 295) +1 mvnsite 0m 28s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 56s the patch passed +1 javadoc 0m 18s the patch passed       Other Tests +1 unit 14m 45s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 38m 5s Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-7087 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883573/YARN-7087.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 75c49998b053 3.13.0-123-generic #172-Ubuntu SMP Mon Jun 26 18:04:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8196a07 Default Java 1.8.0_144 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/17117/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17117/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17117/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Updating the patch to fix the checkstyle issues.

          The TestContainerManager timeout does not appear to be related. It passes for me locally with the patch applied.

          Show
          jlowe Jason Lowe added a comment - Updating the patch to fix the checkstyle issues. The TestContainerManager timeout does not appear to be related. It passes for me locally with the patch applied.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
                trunk Compile Tests
          +1 mvninstall 14m 43s trunk passed
          +1 compile 0m 44s trunk passed
          +1 checkstyle 0m 21s trunk passed
          +1 mvnsite 0m 29s trunk passed
          -1 findbugs 0m 46s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 1 extant Findbugs warnings.
          +1 javadoc 0m 18s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 42s the patch passed
          +1 javac 0m 42s the patch passed
          -0 checkstyle 0m 19s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 4 new + 291 unchanged - 3 fixed = 295 total (was 294)
          +1 mvnsite 0m 26s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 54s the patch passed
          +1 javadoc 0m 16s the patch passed
                Other Tests
          -1 unit 12m 31s hadoop-yarn-server-nodemanager in the patch failed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          34m 49s



          Reason Tests
          Timed out junit tests org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-7087
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883550/YARN-7087.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 1fd835e784dc 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 8196a07
          Default Java 1.8.0_144
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/17113/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/17113/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/17113/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17113/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/17113/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.       trunk Compile Tests +1 mvninstall 14m 43s trunk passed +1 compile 0m 44s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 29s trunk passed -1 findbugs 0m 46s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 1 extant Findbugs warnings. +1 javadoc 0m 18s trunk passed       Patch Compile Tests +1 mvninstall 0m 27s the patch passed +1 compile 0m 42s the patch passed +1 javac 0m 42s the patch passed -0 checkstyle 0m 19s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 4 new + 291 unchanged - 3 fixed = 295 total (was 294) +1 mvnsite 0m 26s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 54s the patch passed +1 javadoc 0m 16s the patch passed       Other Tests -1 unit 12m 31s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 34m 49s Reason Tests Timed out junit tests org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-7087 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12883550/YARN-7087.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 1fd835e784dc 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8196a07 Default Java 1.8.0_144 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/17113/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/17113/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/17113/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/17113/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/17113/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Attaching a patch that adds the container type to the log aggregation container finished event which eliminates the need for AppLogAggregatorImpl to lookup the container in the context and potentially not find it.

          This appears to be occurring quite often on our clusters in cases where an application is killed, so it would be great to fix this for 2.8.2.

          Show
          jlowe Jason Lowe added a comment - Attaching a patch that adds the container type to the log aggregation container finished event which eliminates the need for AppLogAggregatorImpl to lookup the container in the context and potentially not find it. This appears to be occurring quite often on our clusters in cases where an application is killed, so it would be great to fix this for 2.8.2.
          Hide
          jlowe Jason Lowe added a comment -

          Looks like this is related to YARN-221 and YARN-4152. The latter fixed the NPE issue introduced by the first, but unfortunately simply ignoring container IDs that are absent isn't a real fix. The end result when that scenario occurs is that we will always skip aggregating that container's logs, and that may or may not be the desired effect. In this case it was not.

          I believe the scenario occurs because LogAggregationService has not seen the event requesting log aggregation before the NM heartbeats to the RM and then decides to remove the container because the app has completed. The aggregation service appears to only need to lookup the container to get the container type, so maybe we can simply store the container type in the log aggregation event so it doesn't need to lookup the container to process the event.

          Show
          jlowe Jason Lowe added a comment - Looks like this is related to YARN-221 and YARN-4152 . The latter fixed the NPE issue introduced by the first, but unfortunately simply ignoring container IDs that are absent isn't a real fix. The end result when that scenario occurs is that we will always skip aggregating that container's logs, and that may or may not be the desired effect. In this case it was not. I believe the scenario occurs because LogAggregationService has not seen the event requesting log aggregation before the NM heartbeats to the RM and then decides to remove the container because the app has completed. The aggregation service appears to only need to lookup the container to get the container type, so maybe we can simply store the container type in the log aggregation event so it doesn't need to lookup the container to process the event.

            People

            • Assignee:
              jlowe Jason Lowe
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development