Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1278

Fix the per-job tasktracker 'blacklist'

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.12.3
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None

      Description

      Today whenever a tracker is 'lost' all the jobs which ever ran on it are considered as failures and added to the blacklist, which automatically ensures that the particular TT is never considered for allocating new tasks unless all tasktrackers are on the list. This results in an ugly situation where a majority of nodes in the cluster are on the blacklist and hence idle, while the other TTs are maxed out.

      The proposal is two-fold:
      a) Don't count all tasks which ever ran on the TT, we can count it as a 'single' task failure - which means that each 'lost' tracker results in a loss of 20% of the '5 failures == blacklisted' quota.
      b) Stop adding nodes to the blacklist when a certain percentage of the cluster, say 25%, are already on the blacklist - adding more than that would just delay the inevitable i.e. there is something horrendously wrong with the cluster - we might as well fail the job early and noisily.

      Thoughts?

        Attachments

        1. HADOOP-1278_20070427_1.patch
          7 kB
          Arun C Murthy

          Activity

            People

            • Assignee:
              acmurthy Arun C Murthy
              Reporter:
              acmurthy Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: