Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3184

Improve handling of fetch failures when a tasktracker is not responding on HTTP

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.205.0
    • Fix Version/s: 1.0.1
    • Component/s: jobtracker
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      The TaskTracker now has a thread which monitors for a known Jetty bug in which the selector thread starts spinning and map output can no longer be served. If the bug is detected, the TaskTracker will shut itself down. This feature can be disabled by setting mapred.tasktracker.jetty.cpu.check.enabled to false.
      Show
      The TaskTracker now has a thread which monitors for a known Jetty bug in which the selector thread starts spinning and map output can no longer be served. If the bug is detected, the TaskTracker will shut itself down. This feature can be disabled by setting mapred.tasktracker.jetty.cpu.check.enabled to false.

      Description

      On a 100 node cluster, we had an issue where one of the TaskTrackers was hit by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed was the following:

      • every reducer would try to fetch the same map task, and fail after ~13 minutes.
      • At that point, all reducers would report this failed fetch to the JT for the same task, and the task would be re-run.
      • Meanwhile, the reducers would move on to the next map task that ran on the TT, and hang for another 13 minutes.
        The job essentially made no progress for hours, as each map task that ran on the bad node was serially marked failed.

      To combat this issue, we should introduce a second type of failed fetch notification, used when the TT does not respond at all (ie SocketTimeoutException, etc). These fetch failure notifications should count against the TT at large, rather than a single task. If more than half of the reducers report such an issue for a given TT, then all of the tasks from that TT should be re-run.

      1. mr-3184.txt
        13 kB
        Todd Lipcon

        Issue Links

          Activity

          Todd Lipcon created issue -
          Todd Lipcon made changes -
          Field Original Value New Value
          Attachment mr-3184.txt [ 12499475 ]
          Todd Lipcon made changes -
          Assignee Todd Lipcon [ tlipcon ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Release Note The TaskTracker now has a thread which monitors for a known Jetty bug in which the selector thread starts spinning and map output can no longer be served. If the bug is detected, the TaskTracker will shut itself down. This feature can be disabled by setting mapred.tasktracker.jetty.cpu.check.enabled to false.
          Fix Version/s 0.20.206.0 [ 12317960 ]
          Resolution Fixed [ 1 ]
          Matt Foley made changes -
          Fix Version/s 1.0.1 [ 12319503 ]
          Fix Version/s 1.1.0 [ 12317960 ]
          Target Version/s 1.1.0 [ 12317960 ] 1.0.1 [ 12319503 ]
          Matt Foley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Chris Nauroth made changes -
          Link This issue is related to MAPREDUCE-5588 [ MAPREDUCE-5588 ]
          Jordan Zimmerman made changes -
          Assignee Todd Lipcon [ tlipcon ] Jordan Zimmerman [ randgalt ]
          Kihwal Lee made changes -
          Assignee Jordan Zimmerman [ randgalt ] Todd Lipcon [ tlipcon ]

            People

            • Assignee:
              Todd Lipcon
              Reporter:
              Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development