Hadoop Common
  1. Hadoop Common
  2. HADOOP-1862

reduces are getting stuck trying to find map outputs

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.13.0, 0.13.1, 0.14.0, 0.14.1
    • Fix Version/s: 0.14.2
    • Component/s: None
    • Labels:
      None

      Description

      Some of the reduces have been stuck for hours looking for 137 map outputs. When I look at the job events all 2600 of the maps have succeeded. There have been lots of lost task trackers and shuffle failures. The maps have been run between 1 to 6 times each. I do see some of the events in the task event log are marked OBSOLETE.

      1. HADOOP-1862_prelim.patch
        3 kB
        Arun C Murthy
      2. HADOOP-1862_debug.patch
        4 kB
        Arun C Murthy
      3. HADOOP-1862_1_20070912.patch
        17 kB
        Arun C Murthy
      4. HADOOP-1862_2_20070927.patch
        17 kB
        Arun C Murthy
      5. HADOOP-1862_3_20071002.patch
        17 kB
        Arun C Murthy
      6. HADOOP-1862_branch14_1_20071003.patch
        17 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Arun!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Arun!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-Nightly #258 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/258/ )
          Hide
          Arun C Murthy added a comment -

          Attached patch for branch-0.14

          Show
          Arun C Murthy added a comment - Attached patch for branch-0.14
          Hide
          Devaraj Das added a comment -

          Reopening the issue since the patch doesn't apply to 0.14. Please submit a separate patch for that.

          Show
          Devaraj Das added a comment - Reopening the issue since the patch doesn't apply to 0.14. Please submit a separate patch for that.
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks, Arun!

          Show
          Devaraj Das added a comment - I just committed this. Thanks, Arun!
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12366927/HADOOP-1862_3_20071002.patch
          against trunk revision r581101.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests +1. The patch passed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366927/HADOOP-1862_3_20071002.patch against trunk revision r581101. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/865/console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Updated patch, also resets the 'size' variable MapOutputCopier.run on fetch-failures.

          Show
          Arun C Murthy added a comment - Updated patch, also resets the 'size' variable MapOutputCopier.run on fetch-failures.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12366638/HADOOP-1862_2_20070927.patch
          against trunk revision r580461.

          @author +1. The patch does not contain any @author tags.

          javadoc +1. The javadoc tool did not generate any warning messages.

          javac +1. The applied patch does not generate any new compiler warnings.

          findbugs +1. The patch does not introduce any new Findbugs warnings.

          core tests +1. The patch passed core unit tests.

          contrib tests -1. The patch failed contrib unit tests.

          Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/testReport/
          Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366638/HADOOP-1862_2_20070927.patch against trunk revision r580461. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/845/console This message is automatically generated.
          Hide
          Devaraj Das added a comment -

          The patch looks good.

          Show
          Devaraj Das added a comment - The patch looks good.
          Hide
          Devaraj Das added a comment -

          Submitting the patch on behalf of Arun to get it through Hudson.

          Show
          Devaraj Das added a comment - Submitting the patch on behalf of Arun to get it through Hudson.
          Hide
          Arun C Murthy added a comment -

          Here is a well-tested patch - works with and without speculative-execution, lots of lost tasktrackers etc.

          However, I want to wait for Devaraj to run a few tests with this and HADOOP-1874 before we get this in... hence I'm not making it PA right-away. Clearly any reviews/ideas on this are welcome.

          Show
          Arun C Murthy added a comment - Here is a well-tested patch - works with and without speculative-execution, lots of lost tasktrackers etc. However, I want to wait for Devaraj to run a few tests with this and HADOOP-1874 before we get this in... hence I'm not making it PA right-away. Clearly any reviews/ideas on this are welcome.
          Hide
          Devaraj Das added a comment -

          Sorry, I am not convinced that the patch solves the issue at hand. I would like to wait for the problem to be replicated with a version of hadoop that includes the patch for HADOOP-1874. Then I would like to see whether we still run into the situation reported by this bug, apply this patch on that version, and see whether it solves the problem. This is a significant change and I just want to make sure that this patch really addresses the reported bug.

          Show
          Devaraj Das added a comment - Sorry, I am not convinced that the patch solves the issue at hand. I would like to wait for the problem to be replicated with a version of hadoop that includes the patch for HADOOP-1874 . Then I would like to see whether we still run into the situation reported by this bug, apply this patch on that version, and see whether it solves the problem. This is a significant change and I just want to make sure that this patch really addresses the reported bug.
          Hide
          Arun C Murthy added a comment -

          Here is a first take on this one, with very preliminary testing done.

          I know Devaraj has some second-thoughts, but yet... smile

          Show
          Arun C Murthy added a comment - Here is a first take on this one, with very preliminary testing done. I know Devaraj has some second-thoughts, but yet... smile
          Hide
          Arun C Murthy added a comment -

          Ok, here is the proposal to fix this one:

          a) Maintain a straight cache at the TaskTracker i.e. straight copy of the events sent by the JobTracker and don't manipulate them.
          b) Process OBSOLETE TaskCompletionEvent}}s at the {{ReduceTask.
          c) Keep my earlier fix to TaskInProgress.completedTask i.e. do not add a SUCCESS event until it is clear that this is the true successful event, and that the TIP hadn't already been completed by some other task-attempt.

          Show
          Arun C Murthy added a comment - Ok, here is the proposal to fix this one: a) Maintain a straight cache at the TaskTracker i.e. straight copy of the events sent by the JobTracker and don't manipulate them. b) Process OBSOLETE TaskCompletionEvent}}s at the {{ReduceTask . c) Keep my earlier fix to TaskInProgress.completedTask i.e. do not add a SUCCESS event until it is clear that this is the true successful event, and that the TIP hadn't already been completed by some other task-attempt.
          Hide
          Owen O'Malley added a comment -

          Ok, I just found this one and it is a bad one. The problem is that the TaskTracker in fetchMapCompletionEvents stores a cache of the completion events indexed by TIP id instead TASK id. So there is a event race condition between tasks in the same tip and if the last event is the failed one, then the reduces get stuck, because that map is marked as never completed. I've marked this as a 0.14.2 fix, but we might need a 0.13 fix too.

          Show
          Owen O'Malley added a comment - Ok, I just found this one and it is a bad one. The problem is that the TaskTracker in fetchMapCompletionEvents stores a cache of the completion events indexed by TIP id instead TASK id. So there is a event race condition between tasks in the same tip and if the last event is the failed one, then the reduces get stuck, because that map is marked as never completed. I've marked this as a 0.14.2 fix, but we might need a 0.13 fix too.
          Hide
          Arun C Murthy added a comment -

          Ok, I was fortunate to be able to look at an actual job exhibiting this same behaviour (thanks to Christian and HADOOP-1874) and here are some insights:

          Essentially there seems like a issue with TaskCompletionEvent}}s received by the {{ReduceTask. More than one stuck reducer (looking for map outputs) never actually scheduled a copy from the maps from which it is missing outputs, which leads me to believe that there is an issue with missing {{TaskCompletionEvents}}s.

          The other minor bug is the one I pointed out in my previous comment, so I've attached another patch which incorporates:
          a) previous fix
          b) a debug statement to help track the received TaskCompletionEvent}}s at the {{Reduce task.

          Christian: I'd appreciate if you could try this out... Thanks!

          Show
          Arun C Murthy added a comment - Ok, I was fortunate to be able to look at an actual job exhibiting this same behaviour (thanks to Christian and HADOOP-1874 ) and here are some insights: Essentially there seems like a issue with TaskCompletionEvent}}s received by the {{ReduceTask . More than one stuck reducer (looking for map outputs) never actually scheduled a copy from the maps from which it is missing outputs, which leads me to believe that there is an issue with missing {{TaskCompletionEvents}}s. The other minor bug is the one I pointed out in my previous comment, so I've attached another patch which incorporates: a) previous fix b) a debug statement to help track the received TaskCompletionEvent}}s at the {{Reduce task. Christian: I'd appreciate if you could try this out... Thanks!
          Hide
          Arun C Murthy added a comment -

          Illustrative fix...

          Show
          Arun C Murthy added a comment - Illustrative fix...
          Hide
          Arun C Murthy added a comment -

          Hmm... one straw to clutch:

          
          $ cat 1862-event.log | grep task_200709041519_0023_m_001149
          OBSOLETE task_200709041519_0023_m_001149_0 http://a.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_0
          FAILED task_200709041519_0023_m_001149_0 null
          SUCCEEDED task_200709041519_0023_m_001149_1 http://b.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_1
          SUCCEEDED task_200709041519_0023_m_001149_2 http://c.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_2
          
          
          $ cat 1862-event.log | grep task_200709041519_0023_m_001816
          OBSOLETE task_200709041519_0023_m_001816_0 http://x.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_0
          FAILED task_200709041519_0023_m_001816_0 null
          SUCCEEDED task_200709041519_0023_m_001816_1 http://y.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_1
          SUCCEEDED task_200709041519_0023_m_001816_2 http://z.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_2
          
          
          

          Essentially, in JobInProgress.updateTaskStatuses(TaskInProgress, TaskStatus, JobTrackerMetrics) the TaskCompletionEvent.Status.SUCCEEDED is added irrespective of whether the TIP is already complete or not, leading to each reducer seeing 2 TaskCompletionEvent.Status.SUCCEEDED events as above... clearly the fetch from one of them will fail since either _1 or _2 will be KILLED, not a happy situation.

          Like I said, I'll try to dig deeper, maybe this could help someone beat me to it. smile

          Show
          Arun C Murthy added a comment - Hmm... one straw to clutch: $ cat 1862-event.log | grep task_200709041519_0023_m_001149 OBSOLETE task_200709041519_0023_m_001149_0 http://a.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_0 FAILED task_200709041519_0023_m_001149_0 null SUCCEEDED task_200709041519_0023_m_001149_1 http://b.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_1 SUCCEEDED task_200709041519_0023_m_001149_2 http://c.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001149_2 $ cat 1862-event.log | grep task_200709041519_0023_m_001816 OBSOLETE task_200709041519_0023_m_001816_0 http://x.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_0 FAILED task_200709041519_0023_m_001816_0 null SUCCEEDED task_200709041519_0023_m_001816_1 http://y.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_1 SUCCEEDED task_200709041519_0023_m_001816_2 http://z.a.com:50060/tasklog?plaintext=true&taskid=task_200709041519_0023_m_001816_2 Essentially, in JobInProgress.updateTaskStatuses(TaskInProgress, TaskStatus, JobTrackerMetrics) the TaskCompletionEvent.Status.SUCCEEDED is added irrespective of whether the TIP is already complete or not, leading to each reducer seeing 2 TaskCompletionEvent.Status.SUCCEEDED events as above... clearly the fetch from one of them will fail since either _1 or _2 will be KILLED , not a happy situation. Like I said, I'll try to dig deeper, maybe this could help someone beat me to it. smile

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development