Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1087

Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time)

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.10.1
    • None
    • None
    • None

    Description

      2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Server returned HTTP response code: 500 for URL: http://____:____/mapOutput?map=task_7810_m_000897_0&reduce=397
      at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1149)
      at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:121)
      at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:236)
      at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:199)
      2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: task_7810_r_000397_0 adding host ____.com to penalty box, next contact in 279 seconds

      This happened when one of the drives was full and not accessible at map time.

      and one mapper

      public void mergeParts() throws IOException {
      ...
      Path finalIndexFile = mapOutputFile.getOutputIndexFile(getTaskId());

      failed on the first hash entry in mapred.local.dir and used the second entry

      Afterwards, first dir entry became available and when reducer tried to pull through,
      public static class MapOutputServlet extends HttpServlet {
      ...
      Path indexFileName = conf.getLocalPath(mapId+"/file.out.index");

      it used the first entry.

      As a result, directory was empty and reducer kept on trying to pull from the incorrect path and hang.

      (wasn't sure if this is a duplicate of HADOOP-895 since it is not reproducible unless I get disk failure.)

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            knoguchi Koji Noguchi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment