Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4992

AM hangs in RecoveryService when recovering tasks with speculative attempts

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: trunk, 2.0.2-alpha, 0.23.6
    • Fix Version/s: 0.23.7, 2.1.0-beta
    • Component/s: mr-am
    • Labels:
      None

      Description

      A job hung in the Recovery Service on an AM restart. There were four map tasks events that were not processed and that prevented the complete task count from reaching zero which exits the recovery service. All four tasks were speculative

      1. MAPREDUCE-4992v1.patch
        15 kB
        Robert Parker
      2. MAPREDUCE-4992v2.patch
        15 kB
        Robert Parker

        Issue Links

          Activity

          Hide
          Robert Parker added a comment -

          Added code to identify all the completed task attempts in the JobHistory parser and then removed incomplete task attempts from the completed tasks identified in the recovery service.

          Show
          Robert Parker added a comment - Added code to identify all the completed task attempts in the JobHistory parser and then removed incomplete task attempts from the completed tasks identified in the recovery service.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12568892/MAPREDUCE-4992v1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 eclipse:eclipse. The patch failed to build with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3326//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3326//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568892/MAPREDUCE-4992v1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. -1 eclipse:eclipse . The patch failed to build with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3326//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3326//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          This approach looks OK for the short-term. I'm not thrilled about the idea of explicitly losing information on task attempts that happened to not complete, as it will be odd in the history of the recovered AM to see a map task with a single attempt that ends in _1 or _2 instead of _0. If this goes in we should file a follow-up JIRA to fix recovery so attempts that were "in-flight" when the AM crashed are at least documented in some way on the subsequent AM (e.g.: we mark them as KILLED or something, but at least the user can see what nodes they ran on and what time they were launched).

          There is one thing I'd like to see fixed in the patch. When we're iterating the taskAttempts in the taskInfo and filtering out attempts that didn't complete, we should walk and remove entries using an iterator rather than reaching around and calling remove on the map.

          Show
          Jason Lowe added a comment - This approach looks OK for the short-term. I'm not thrilled about the idea of explicitly losing information on task attempts that happened to not complete, as it will be odd in the history of the recovered AM to see a map task with a single attempt that ends in _1 or _2 instead of _0. If this goes in we should file a follow-up JIRA to fix recovery so attempts that were "in-flight" when the AM crashed are at least documented in some way on the subsequent AM (e.g.: we mark them as KILLED or something, but at least the user can see what nodes they ran on and what time they were launched). There is one thing I'd like to see fixed in the patch. When we're iterating the taskAttempts in the taskInfo and filtering out attempts that didn't complete, we should walk and remove entries using an iterator rather than reaching around and calling remove on the map.
          Hide
          Robert Parker added a comment -

          Thanks for the review Jason. I incorporated the iterator to remove the entries, which is a much better approach.

          Show
          Robert Parker added a comment - Thanks for the review Jason. I incorporated the iterator to remove the entries, which is a much better approach.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12568916/MAPREDUCE-4992v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3328//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3328//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568916/MAPREDUCE-4992v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3328//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3328//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          +1 lgtm.

          Show
          Jason Lowe added a comment - +1 lgtm.
          Hide
          Jason Lowe added a comment -

          Thanks, Robert! I committed this to trunk, branch-2, and branch-0.23.

          Show
          Jason Lowe added a comment - Thanks, Robert! I committed this to trunk, branch-2, and branch-0.23.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3350 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3350/)
          MAPREDUCE-4992. AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456)

          Result = SUCCESS
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3350 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3350/ ) MAPREDUCE-4992 . AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #126 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/126/)
          MAPREDUCE-4992. AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456)

          Result = SUCCESS
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #126 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/126/ ) MAPREDUCE-4992 . AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #524 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/524/)
          svn merge -c 1445456 FIXES: MAPREDUCE-4992. AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445458)

          Result = SUCCESS
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445458
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #524 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/524/ ) svn merge -c 1445456 FIXES: MAPREDUCE-4992 . AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445458) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445458 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1315 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1315/)
          MAPREDUCE-4992. AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456)

          Result = FAILURE
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1315 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1315/ ) MAPREDUCE-4992 . AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456) Result = FAILURE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1343 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1343/)
          MAPREDUCE-4992. AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456)

          Result = FAILURE
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1343 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1343/ ) MAPREDUCE-4992 . AM hangs in RecoveryService when recovering tasks with speculative attempts. Contributed by Robert Parker (Revision 1445456) Result = FAILURE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1445456 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/recover/RecoveryService.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRecovery.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/jobhistory/JobHistoryParser.java
          Hide
          Jason Lowe added a comment -

          This is still occurring in a number of ways:

          • If the task attempt that succeeded was attempt 1 but there is no completion event in the history file for attempt 0, it recovers only attempt 0 but is waiting for attempt 1 to complete.
          • If two task attempts succeed simultaneously it only recovers attempt 0 but is waiting for attempt 1 to complete.
          • If the prior AM attempt was backed up in event processing and launched speculative task attempts after a task attempt completed then it ends up waiting on them but they were never launched.
          Show
          Jason Lowe added a comment - This is still occurring in a number of ways: If the task attempt that succeeded was attempt 1 but there is no completion event in the history file for attempt 0, it recovers only attempt 0 but is waiting for attempt 1 to complete. If two task attempts succeed simultaneously it only recovers attempt 0 but is waiting for attempt 1 to complete. If the prior AM attempt was backed up in event processing and launched speculative task attempts after a task attempt completed then it ends up waiting on them but they were never launched.
          Hide
          Thomas Graves added a comment -

          resolving this as its committed. The additional work is being done under MAPREDUCE-5079

          Show
          Thomas Graves added a comment - resolving this as its committed. The additional work is being done under MAPREDUCE-5079

            People

            • Assignee:
              Robert Parker
              Reporter:
              Robert Parker
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development