Hadoop Common
  1. Hadoop Common
  2. HADOOP-343

In case of dead task tracker, the copy mapouts try copying all mapoutputs from this tasktracker

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.2
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      In case of a dead task tracker, the reduces which do not have the updated map out locations try copygin files from this node and since there are failures on copying, this leads to backoff and slowing down of the copy pahse.

      1. bugfix.patch
        2 kB
        Mahadev konar
      2. cache-purge.txt
        1 kB
        Sameer Paranjpye

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment -

          This patch updates its mapoutput locations in case of any failures on copying from the task tracker. So, in case a copy failed from a task tracker, the map outputs corresponding to this node will be marked stale. The ReduceTask will ask for the mapout locations again for these stale map outputs. This patch also fixes a bug wherein the tasks keep polling the job tracker for map otputs "in a loop" without sleeping/waiting. With this fix the tasks will poll evry MIN_POLL_INTERVAL before querying the job tracker.

          Show
          Mahadev konar added a comment - This patch updates its mapoutput locations in case of any failures on copying from the task tracker. So, in case a copy failed from a task tracker, the map outputs corresponding to this node will be marked stale. The ReduceTask will ask for the mapout locations again for these stale map outputs. This patch also fixes a bug wherein the tasks keep polling the job tracker for map otputs "in a loop" without sleeping/waiting. With this fix the tasks will poll evry MIN_POLL_INTERVAL before querying the job tracker.
          Hide
          Owen O'Malley added a comment -

          I have concerns about this patch. It might have unintended consequences. In particular, if a node is slow you'll drop what you know about that node and put more load on the job tracker. This patch is certainly addressing a real problem, though.

          Show
          Owen O'Malley added a comment - I have concerns about this patch. It might have unintended consequences. In particular, if a node is slow you'll drop what you know about that node and put more load on the job tracker. This patch is certainly addressing a real problem, though.
          Hide
          eric baldeschwieler added a comment -

          Don't forget HADOOP-248 . This could be implemented in such a way that once a single task fails to reach the job all task trackers are notified of the failure.

          Show
          eric baldeschwieler added a comment - Don't forget HADOOP-248 . This could be implemented in such a way that once a single task fails to reach the job all task trackers are notified of the failure.
          Hide
          Sameer Paranjpye added a comment -

          I think this needs further attention. The patch is probably out of date at this point, but the problem is real. I think this may also be responsible for the 'long pause' at the end of the shuffle.

          If a tasktracker fails, it's map outputs are lost. However, the other tasktrackers are unaware of this. Towards the end of the shuffle they have all map output locations cached and they keep trying to pull data from the lost tasktracker, one file at a time. Every one of these file transfers fails, each failed transfer also causes the tasktrackers to back off from pulling their remaining outputs. The cumulative effect of all the backoffs is the long pause.

          Show
          Sameer Paranjpye added a comment - I think this needs further attention. The patch is probably out of date at this point, but the problem is real. I think this may also be responsible for the 'long pause' at the end of the shuffle. If a tasktracker fails, it's map outputs are lost. However, the other tasktrackers are unaware of this. Towards the end of the shuffle they have all map output locations cached and they keep trying to pull data from the lost tasktracker, one file at a time. Every one of these file transfers fails, each failed transfer also causes the tasktrackers to back off from pulling their remaining outputs. The cumulative effect of all the backoffs is the long pause.
          Hide
          Sameer Paranjpye added a comment -

          This patch addresses the concern raised. If a map output transfer from a particular tasktracker fails, other output locations from the tasktracker that are present in the cache are removed. This addresses the problem of repeated attempts and backoffs from a lost tasktracker, which is particularly bad towards the end of a shuffle. Copies can, of course, fail for other reasons, in these cases also output locations are removed. The cost of this removal is fairly low. This is because the number of output locations cached for a specific tasktracker is usually small (3-4), and removing these (multiple times even) results in a handful of extra polls of the jobtracker.

          Show
          Sameer Paranjpye added a comment - This patch addresses the concern raised. If a map output transfer from a particular tasktracker fails, other output locations from the tasktracker that are present in the cache are removed. This addresses the problem of repeated attempts and backoffs from a lost tasktracker, which is particularly bad towards the end of a shuffle. Copies can, of course, fail for other reasons, in these cases also output locations are removed. The cost of this removal is fairly low. This is because the number of output locations cached for a specific tasktracker is usually small (3-4), and removing these (multiple times even) results in a handful of extra polls of the jobtracker.
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Sameer!

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Sameer!

            People

            • Assignee:
              Sameer Paranjpye
              Reporter:
              Mahadev konar
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development