Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1278

Fix the per-job tasktracker 'blacklist'

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.12.3
    • 0.13.0
    • None
    • None

    Description

      Today whenever a tracker is 'lost' all the jobs which ever ran on it are considered as failures and added to the blacklist, which automatically ensures that the particular TT is never considered for allocating new tasks unless all tasktrackers are on the list. This results in an ugly situation where a majority of nodes in the cluster are on the blacklist and hence idle, while the other TTs are maxed out.

      The proposal is two-fold:
      a) Don't count all tasks which ever ran on the TT, we can count it as a 'single' task failure - which means that each 'lost' tracker results in a loss of 20% of the '5 failures == blacklisted' quota.
      b) Stop adding nodes to the blacklist when a certain percentage of the cluster, say 25%, are already on the blacklist - adding more than that would just delay the inevitable i.e. there is something horrendously wrong with the cluster - we might as well fail the job early and noisily.

      Thoughts?

      Attachments

        1. HADOOP-1278_20070427_1.patch
          7 kB
          Arun Murthy

        Activity

          People

            acmurthy Arun Murthy
            acmurthy Arun Murthy
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: