Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: jobtracker
    • Labels:
      None

      Description

      The current heuristic of rolling up fixed number of job failures per tracker isn't working well, we need better design/heuristics.

        Issue Links

          Activity

          Hide
          Ahmed Radwan added a comment -

          This change seems to be already pushed to branch-1.0, I think we can close this ticket.

          commit 2670ab9a04634773659a471f42d9431ef71944cc
          Author: Owen O'Malley <omalley@apache.org>
          Date:   Fri Mar 4 04:33:31 2011 +0000
          
              commit 6939f6854b330a01cc4427f4c657df0c3c4d53ab
              Author: Arun C Murthy <acmurthy@apache.org>
              Date:   Fri Jul 23 15:39:49 2010 -0700
          
                  MAPREDUCE-1966. Change blacklisting of tasktrackers on task failures to be a simple graylist to fingerpoint bad tasktrackers. Contributed by Gre$
          
                  +++ b/YAHOO-CHANGES.txt
                  +    MAPREDUCE-1966. Change blacklisting of tasktrackers on task failures to be
                  +    a simple graylist to fingerpoint bad tasktrackers. (Greg Roelofs via
                  +    acmurthy)
                  +
          
          
              git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-patches@1077596 13f79535-47bb-0310-9956-ffa450edef68
          
          Show
          Ahmed Radwan added a comment - This change seems to be already pushed to branch-1.0, I think we can close this ticket. commit 2670ab9a04634773659a471f42d9431ef71944cc Author: Owen O'Malley <omalley@apache.org> Date: Fri Mar 4 04:33:31 2011 +0000 commit 6939f6854b330a01cc4427f4c657df0c3c4d53ab Author: Arun C Murthy <acmurthy@apache.org> Date: Fri Jul 23 15:39:49 2010 -0700 MAPREDUCE-1966. Change blacklisting of tasktrackers on task failures to be a simple graylist to fingerpoint bad tasktrackers. Contributed by Gre$ +++ b/YAHOO-CHANGES.txt + MAPREDUCE-1966. Change blacklisting of tasktrackers on task failures to be + a simple graylist to fingerpoint bad tasktrackers. (Greg Roelofs via + acmurthy) + git-svn-id: https: //svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-patches@1077596 13f79535-47bb-0310-9956-ffa450edef68
          Jonathan Eagles made changes -
          Link This issue is cloned as MAPREDUCE-2526 [ MAPREDUCE-2526 ]
          Jonathan Eagles made changes -
          Link This issue relates to MAPREDUCE-2490 [ MAPREDUCE-2490 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to MAPREDUCE-2231 [ MAPREDUCE-2231 ]
          Greg Roelofs made changes -
          Hide
          Greg Roelofs added a comment -

          "Reasonably final" patch, pending review, test-patch, etc. I'm firing off tests in a few minutes.

          Show
          Greg Roelofs added a comment - "Reasonably final" patch, pending review, test-patch, etc. I'm firing off tests in a few minutes.
          Greg Roelofs made changes -
          Attachment MR-1966.v1.trunk-hadoop-mapreduce.patch [ 12452483 ]
          Hide
          Greg Roelofs added a comment -

          Initial patch; only minimal testing so far.

          Still working on TestTaskTrackerBlacklisting.java, which requires some care.

          Show
          Greg Roelofs added a comment - Initial patch; only minimal testing so far. Still working on TestTaskTrackerBlacklisting.java, which requires some care.
          Greg Roelofs made changes -
          Field Original Value New Value
          Assignee Greg Roelofs [ roelofs ]
          Hide
          Greg Roelofs added a comment -

          There's an ambiguity between sick nodes (typically due to failing hardware, either hard drive or memory or occasionally NIC/network switch) and nodes that have been rendered unresponsive due to user abuse. The existing blacklist heuristics touch on this, but they're a bit ad hoc, and there's not much visibility on the internal state at any given time.

          One improvement would be to track the per-node, per-job blacklisting history in a sliding window that's divided into buckets of some suitable granularity. Bad hardware would tend to show up as an elevated fault level on one node (or a few nodes) for an extended period-i.e., multiple buckets-while abusive jobs would tend to show up as a spike (ideally) or at least a limited-duration jump in faults (one or a few buckets) across many nodes.

          Because the heuristics are open to argument even among experts (which would not include me), and because automatic, hardcoded blacklisting has the potential to wipe out a good fraction of a cluster for the wrong reasons, it would seem best to convert the heuristic form of blacklisting to an advisory mode (i.e., "graylisting") until the behavior is better understood.

          Show
          Greg Roelofs added a comment - There's an ambiguity between sick nodes (typically due to failing hardware, either hard drive or memory or occasionally NIC/network switch) and nodes that have been rendered unresponsive due to user abuse. The existing blacklist heuristics touch on this, but they're a bit ad hoc, and there's not much visibility on the internal state at any given time. One improvement would be to track the per-node, per-job blacklisting history in a sliding window that's divided into buckets of some suitable granularity. Bad hardware would tend to show up as an elevated fault level on one node (or a few nodes) for an extended period- i.e., multiple buckets -while abusive jobs would tend to show up as a spike (ideally) or at least a limited-duration jump in faults (one or a few buckets) across many nodes. Because the heuristics are open to argument even among experts (which would not include me), and because automatic, hardcoded blacklisting has the potential to wipe out a good fraction of a cluster for the wrong reasons, it would seem best to convert the heuristic form of blacklisting to an advisory mode (i.e., "graylisting") until the behavior is better understood.
          Arun C Murthy created issue -

            People

            • Assignee:
              Greg Roelofs
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:

                Development