Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5043

Fetch failure processing can cause AM event queue to backup and eventually OOM

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.7, 2.1.0-beta
    • Fix Version/s: 0.23.7, 2.1.0-beta
    • Component/s: mr-am
    • Labels:
      None

      Description

      Saw an MRAppMaster with a 3G heap OOM. Upon investigating another instance of it running, we saw the UI in a weird state where the task table and task attempt tables in the job overview page weren't consistent. The AM log showed the AsyncDispatcher had hundreds of thousands of events in the event queue, and jstacks showed it spending a lot of time in fetch failure processing. It turns out fetch failure processing is currently very expensive, with a triple for loop where the inner loop is calling the quite-expensive TaskAttempt.getReport. That function ends up type-converting the entire task report, counters and all, and performing locale conversions among other things. It does this for every reduce task in the job, for every map task that failed. And when it's done building up the large task report, it pulls out one field, the phase, then throws the report away.

      While the AM is busy processing fetch failures, tasks attempts are continuing to send events to the AM including memory-expensive events like status updates which include the counters. These back up in the AsyncDispatcher event queue and eventually even an AM with a large heap size will run out of memory and crash or expire because it thrashes in garbage collect.

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #548 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/548/)
          svn merge -c 1452372 FIXES: MAPREDUCE-5043 Fetch failure processing can cause AM event queue to backup and eventually OOM Contributed by Jason Lowe (Revision 1454440)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1454440
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #548 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/548/ ) svn merge -c 1452372 FIXES: MAPREDUCE-5043 Fetch failure processing can cause AM event queue to backup and eventually OOM Contributed by Jason Lowe (Revision 1454440) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1454440 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1335 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1335/)
          MAPREDUCE-5043. Fetch failure processing can cause AM event queue to backup and eventually OOM (Jason Lowe via bobby) (Revision 1452372)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1452372
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1335 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1335/ ) MAPREDUCE-5043 . Fetch failure processing can cause AM event queue to backup and eventually OOM (Jason Lowe via bobby) (Revision 1452372) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1452372 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #146 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/146/)
          MAPREDUCE-5043. Fetch failure processing can cause AM event queue to backup and eventually OOM (Jason Lowe via bobby) (Revision 1452372)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1452372
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #146 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/146/ ) MAPREDUCE-5043 . Fetch failure processing can cause AM event queue to backup and eventually OOM (Jason Lowe via bobby) (Revision 1452372) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1452372 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Hide
          Robert Joseph Evans added a comment -

          Thanks Jason,

          I put this into trunk, branch-2, and branch-0.23

          Show
          Robert Joseph Evans added a comment - Thanks Jason, I put this into trunk, branch-2, and branch-0.23
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3408 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3408/)
          MAPREDUCE-5043. Fetch failure processing can cause AM event queue to backup and eventually OOM (Jason Lowe via bobby) (Revision 1452372)

          Result = SUCCESS
          bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1452372
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3408 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3408/ ) MAPREDUCE-5043 . Fetch failure processing can cause AM event queue to backup and eventually OOM (Jason Lowe via bobby) (Revision 1452372) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1452372 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java
          Hide
          Robert Joseph Evans added a comment -

          The change looks fine to me and Jenkins returned with a +1 so I am +1 too. I'll check it in.

          Show
          Robert Joseph Evans added a comment - The change looks fine to me and Jenkins returned with a +1 so I am +1 too. I'll check it in.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12571761/MAPREDUCE-5043.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 tests included appear to have a timeout.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3377//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3377//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12571761/MAPREDUCE-5043.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 tests included appear to have a timeout. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3377//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3377//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          Patch to implement the proposed approach. No unit tests since this is an optimization.

          Did some rudimentary performance testing on a sleep job with 1000 maps and 1000 reduces and forcing the reducer's Fetcher to always fail. With this job profile fetch failure processing appears to be well over 10x faster. Before the patch processing a fetch failure for this job profile took around 10msec per map attempt in the event. After the patch the processing a single map attempt in a fetch failure event was well less than a millisecond, and processing 30+ map attempts was taking around 1 millisecond total.

          Show
          Jason Lowe added a comment - Patch to implement the proposed approach. No unit tests since this is an optimization. Did some rudimentary performance testing on a sleep job with 1000 maps and 1000 reduces and forcing the reducer's Fetcher to always fail. With this job profile fetch failure processing appears to be well over 10x faster. Before the patch processing a fetch failure for this job profile took around 10msec per map attempt in the event. After the patch the processing a single map attempt in a fetch failure event was well less than a millisecond, and processing 30+ map attempts was taking around 1 millisecond total.
          Hide
          Jason Lowe added a comment -

          One approach to fixing this is to have TaskAttempt provide a cheap interface for getting just the phase. The fetch failure processing can then compute the total number of reducers in the shuffle phase before iterating through the maps with fetch failures rather than computing it redundantly for each map attempt.

          Show
          Jason Lowe added a comment - One approach to fixing this is to have TaskAttempt provide a cheap interface for getting just the phase. The fetch failure processing can then compute the total number of reducers in the shuffle phase before iterating through the maps with fetch failures rather than computing it redundantly for each map attempt.

            People

            • Assignee:
              Jason Lowe
              Reporter:
              Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development