Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4852

Reducer should not signal fetch failures for disk errors on the reducer's side

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • mrv2
    • None

    Description

      Ran across a case where a reducer ran on a node where the disks were full, leading to an exception like this during the shuffle fetch:

      2012-12-05 09:07:28,749 INFO [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.MergeManager: attempt_1352354913026_138167_m_000654_0: Shuffling to disk since 235056188 is greater than maxSingleShuffleLimit (155104064)
      2012-12-05 09:07:28,755 INFO [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#25 failed to read map headerattempt_1352354913026_138167_m_000654_0 decomp: 235056188, 101587629
      org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1352354913026_138167_r_000189_0/map_654.out
      	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
      	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
      	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
      	at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
      	at org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:81)
      	at org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:245)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:348)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:283)
      	at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:155)
      2012-12-05 09:07:28,755 WARN [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.Fetcher: copyMapOutput failed for tasks [attempt_1352354913026_138167_m_000654_0]
      2012-12-05 09:07:28,756 INFO [fetcher#25] org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Reporting fetch failure for attempt_1352354913026_138167_m_000654_0 to jobtracker.
      

      Even though the error was local to the reducer, it reported the error as a fetch failure to the AM than failing the reducer itself. It then proceeded to run into the same error for many other maps, causing them to relaunch from reported fetch failures. In this case it would have been better to fail the reducer and try another node rather than blame the mapper for what is an error on the reducer's side.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            jlowe Jason Darrell Lowe
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment