Hadoop Common
  1. Hadoop Common
  2. HADOOP-4305

repeatedly blacklisted tasktrackers should get declared dead

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Improved TaskTracker blacklisting strategy to better exclude faulty tracker from executing tasks.

      Description

      When running a batch of jobs it often happens that the same tasktrackers are blacklisted again and again. This can slow job execution considerably, in particular, when tasks fail because of timeout.
      It would make sense to no longer assign any tasks to such tasktrackers and to declare them dead.

      1. patch-4305-1.txt
        21 kB
        Amareshwari Sriramadasu
      2. patch-4305-0.18.txt
        21 kB
        Amareshwari Sriramadasu
      3. patch-4305-2.txt
        29 kB
        Amareshwari Sriramadasu
      4. patch-4305-3.txt
        31 kB
        Amareshwari Sriramadasu
      5. patch-4305-4.txt
        37 kB
        Amareshwari Sriramadasu

        Issue Links

          Activity

          Hide
          Robert Chansler added a comment -

          Edit release note for publication.

          Improves the blacklisting strategy, whereby, tasktrackers that are blacklisted are not given tasks to run from other jobs, subject to the following conditions (all must be met):
          1) The TaskTracker has been blacklisted by at least 4 jobs ( can be configured by mapred.max.tasktracker.blacklists)
          2) The TaskTracker has been blacklisted 50% more number of times than the average
          3) The cluster has less than 50% trackers blacklisted.
          Once in 24 hours, a TaskTracker blacklisted for all jobs is given a chance.
          Restarting the TaskTracker moves it out of the blacklist.

          Show
          Robert Chansler added a comment - Edit release note for publication. Improves the blacklisting strategy, whereby, tasktrackers that are blacklisted are not given tasks to run from other jobs, subject to the following conditions (all must be met): 1) The TaskTracker has been blacklisted by at least 4 jobs ( can be configured by mapred.max.tasktracker.blacklists) 2) The TaskTracker has been blacklisted 50% more number of times than the average 3) The cluster has less than 50% trackers blacklisted. Once in 24 hours, a TaskTracker blacklisted for all jobs is given a chance. Restarting the TaskTracker moves it out of the blacklist.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #680 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/ )
          Hide
          Devaraj Das added a comment -

          I just committed this. Thanks Amareshwari!

          Show
          Devaraj Das added a comment - I just committed this. Thanks Amareshwari!
          Hide
          Amareshwari Sriramadasu added a comment -

          Added junit testcase to test the blacklisting strategy.

          test-patch result:

               [exec] +1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 7 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          
          Show
          Amareshwari Sriramadasu added a comment - Added junit testcase to test the blacklisting strategy. test-patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 7 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          Hide
          Sharad Agarwal added a comment -

          +1 for heartbeat related changes in JT and TT.

          Show
          Sharad Agarwal added a comment - +1 for heartbeat related changes in JT and TT.
          Hide
          Christian Kunz added a comment -

          We applied the patch to hadoop-0.18 and verified that two TaskTrackers repeatedly blacklisted disappeared from the list of active TaskTrackers:

          2008-12-04 02:46:04,251 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker_xxxto the blackList
          2008-12-04 08:51:09,957 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker_yyyto the blackList

          These two TaskTrackers indeed had disk problems and the corresponding DataNodes were not active as well.

          Great Job!

          Show
          Christian Kunz added a comment - We applied the patch to hadoop-0.18 and verified that two TaskTrackers repeatedly blacklisted disappeared from the list of active TaskTrackers: 2008-12-04 02:46:04,251 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker_xxxto the blackList 2008-12-04 08:51:09,957 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker_yyyto the blackList These two TaskTrackers indeed had disk problems and the corresponding DataNodes were not active as well. Great Job!
          Hide
          Amareshwari Sriramadasu added a comment -

          test-patch result on trunk:

               [exec]
               [exec] -1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
               [exec]                         Please justify why no tests are needed for this patch.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec]
          

          All core and contrib tests passed on my machine.

          It is very difficult to write junit test for this. Did manual tests.
          Manual tests include:

          1. Verified that a tracker should get blacklisted across all jobs iff
            • #blacklists exceeds mapred.max.tracker.blacklists.
            • #blacklists is 50% above the average #blacklists, over the active and potentially faulty trackers
            • 50% of the cluster is not blacklisted yet
          2. Verified Restarting the tracker makes it a healthy tracker. Verified restart on a healthy tracker, blacklisted tracker and a lost tracker.
          3. Verified that Lost task tracker that bounced back should accept tasks again
          4. Verified that Once blacklisted tracker, if lost and comes back, it is not healthy.
          Show
          Amareshwari Sriramadasu added a comment - test-patch result on trunk: [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] All core and contrib tests passed on my machine. It is very difficult to write junit test for this. Did manual tests. Manual tests include: Verified that a tracker should get blacklisted across all jobs iff #blacklists exceeds mapred.max.tracker.blacklists. #blacklists is 50% above the average #blacklists, over the active and potentially faulty trackers 50% of the cluster is not blacklisted yet Verified Restarting the tracker makes it a healthy tracker. Verified restart on a healthy tracker, blacklisted tracker and a lost tracker. Verified that Lost task tracker that bounced back should accept tasks again Verified that Once blacklisted tracker, if lost and comes back, it is not healthy.
          Hide
          Amareshwari Sriramadasu added a comment -

          Attaching patch for incorporating review comments.

          Show
          Amareshwari Sriramadasu added a comment - Attaching patch for incorporating review comments.
          Hide
          Devaraj Das added a comment -

          Some comments:
          1. Format if condition brackets properly in incrementFaults method
          2. You should be able to use the same datastructure for both potentiallyFaulty and blacklisted trackers.
          3. Add a comment for mapred.cluster.average.blacklist.threshold that it is there solely for tuning purposes and once this feature has been tested in real clusters and an appropriate value for the threshold has been found, this config might be taken out.
          4. Check whether you can remove initialContact flag and use only the restarted flag in the heartbeat method. This is a more serious change but might be worthwhile in simplifying the state machine.

          Show
          Devaraj Das added a comment - Some comments: 1. Format if condition brackets properly in incrementFaults method 2. You should be able to use the same datastructure for both potentiallyFaulty and blacklisted trackers. 3. Add a comment for mapred.cluster.average.blacklist.threshold that it is there solely for tuning purposes and once this feature has been tested in real clusters and an appropriate value for the threshold has been found, this config might be taken out. 4. Check whether you can remove initialContact flag and use only the restarted flag in the heartbeat method. This is a more serious change but might be worthwhile in simplifying the state machine.
          Hide
          Amareshwari Sriramadasu added a comment -

          The only test failure org.apache.hadoop.hdfsproxy.TestHdfsProxy.testHdfsProxyInterface is not related to the patch. All the core and contrig tests passed on my machine.

          Show
          Amareshwari Sriramadasu added a comment - The only test failure org.apache.hadoop.hdfsproxy.TestHdfsProxy.testHdfsProxyInterface is not related to the patch. All the core and contrig tests passed on my machine.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394532/patch-4305-2.txt
          against trunk revision 720632.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394532/patch-4305-2.txt against trunk revision 720632. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Patch incorporating review comments.
          Can somebody please review this?

          Show
          Amareshwari Sriramadasu added a comment - Patch incorporating review comments. Can somebody please review this?
          Hide
          Devaraj Das added a comment -

          1. Change ConvertTrackerNameToHostName not to include "tracker_" and remove the extra List introduced in JobInProgress.
          2. Faults in PotentiallyFaultyList should also get decremented if there are no faults on that tracker for 24 hours.
          3. Make AVERAGE_BLACKLST_THRESHOLD configurable. but do not expose it outside (this will give us a way to tune until we reach the correct number).
          4. Put more comments in code explaining the algorithm.
          5. change the name of addFaultyTracker to incrementFaults
          6. ReinitAction on lostTaskTracker should not erase its faults.

          Show
          Devaraj Das added a comment - 1. Change ConvertTrackerNameToHostName not to include "tracker_" and remove the extra List introduced in JobInProgress. 2. Faults in PotentiallyFaultyList should also get decremented if there are no faults on that tracker for 24 hours. 3. Make AVERAGE_BLACKLST_THRESHOLD configurable. but do not expose it outside (this will give us a way to tune until we reach the correct number). 4. Put more comments in code explaining the algorithm. 5. change the name of addFaultyTracker to incrementFaults 6. ReinitAction on lostTaskTracker should not erase its faults.
          Hide
          Amareshwari Sriramadasu added a comment -

          Christian, here is a patch for 0.18. Can you validate the patch on your cluster?

          Show
          Amareshwari Sriramadasu added a comment - Christian, here is a patch for 0.18. Can you validate the patch on your cluster?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12393865/patch-4305-1.txt
          against trunk revision 713893.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12393865/patch-4305-1.txt against trunk revision 713893. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/console This message is automatically generated.
          Hide
          Amareshwari Sriramadasu added a comment -

          Here is a patch with proposed fix.
          The patch does the following:

          • Adds the blacklisted trackers of the job to the potentially faulty list, in JobTracker.finalizeJob()
          • The tracker is moved to blacklisted trackers (across jobs) from potentially faulty list iff
            • #blacklists exceed mapred.max.tracker.blacklists (default value is 4),
            • #blacklists is 50% above the average #blacklists, over the active and potentially faulty trackers
            • 50% the cluster is not blacklisted yet
          • Restarting the tracker makes it an active tracker
          • After a day, the tarcker is given a chance again to run tasks
          • Adds #blacklisted_trackers to ClusterStatus
          • Updates web UI to show the blacklisted trackers.
          Show
          Amareshwari Sriramadasu added a comment - Here is a patch with proposed fix. The patch does the following: Adds the blacklisted trackers of the job to the potentially faulty list, in JobTracker.finalizeJob() The tracker is moved to blacklisted trackers (across jobs) from potentially faulty list iff #blacklists exceed mapred.max.tracker.blacklists (default value is 4), #blacklists is 50% above the average #blacklists, over the active and potentially faulty trackers 50% the cluster is not blacklisted yet Restarting the tracker makes it an active tracker After a day, the tarcker is given a chance again to run tasks Adds #blacklisted_trackers to ClusterStatus Updates web UI to show the blacklisted trackers.
          Hide
          Amareshwari Sriramadasu added a comment -

          Runping's proposal can be encompassed into Amareshwari's proposal too.....reflect the state of the TT (how many jobs was it running simultaneously) by incrementing the blacklist counter with an appropriate weight.

          I think you meant how many tasks the tracker was running simultaneously at the time of failure. But, in steady state all the slots of the tracker will be occupied. then, the blacklist weight would be same for all the trackers.

          The tracker is blacklisted across all jobs if #blacklists is X% above the average #blacklists, over all the trackers.

          Here, The average value may be very skewed, since very few trackers would be faulty. (In my previous example, it should be X=2500%)
          To avoid skewness, the tracker can be blacklisted across all jobs if
          1. #blacklists is greater than mapred.max.tasktracker.blacklists and
          2. #blacklists is 50% above the average #blacklists, over all the trackers.

          Show
          Amareshwari Sriramadasu added a comment - Runping's proposal can be encompassed into Amareshwari's proposal too.....reflect the state of the TT (how many jobs was it running simultaneously) by incrementing the blacklist counter with an appropriate weight. I think you meant how many tasks the tracker was running simultaneously at the time of failure. But, in steady state all the slots of the tracker will be occupied. then, the blacklist weight would be same for all the trackers. The tracker is blacklisted across all jobs if #blacklists is X% above the average #blacklists, over all the trackers. Here, The average value may be very skewed, since very few trackers would be faulty. (In my previous example, it should be X=2500%) To avoid skewness, the tracker can be blacklisted across all jobs if 1. #blacklists is greater than mapred.max.tasktracker.blacklists and 2. #blacklists is 50% above the average #blacklists, over all the trackers.
          Hide
          dhruba borthakur added a comment -

          I like Amareshwati's proposal because it is simple. Owen's extension seems to add a "age-ing" factor to the counter. And, Runping's proposal can be encompassed into Amareshwari's proposal too.....reflect the state of the TT (how many jobs was it running simultaneously) by incrementing the blacklist counter with an appropriate weight.

          Show
          dhruba borthakur added a comment - I like Amareshwati's proposal because it is simple. Owen's extension seems to add a "age-ing" factor to the counter. And, Runping's proposal can be encompassed into Amareshwari's proposal too.....reflect the state of the TT (how many jobs was it running simultaneously) by incrementing the blacklist counter with an appropriate weight.
          Hide
          Christian Kunz added a comment -

          I would be happy with Amareshwari's proposal together with Owen's suggestion. I believe this would help us a lot in our environment.
          To only count blacklisted TaskTrackers for successful jobs seems necessary to avoid false positives because of application issues (although we just had a case where enough bad-behaving TaskTrackers generated enough individual task failures to fail jobs repeatedly). Solutions that cover 80% of issues with 20% of development effort are better than perfect solutions that are never implemented.

          Show
          Christian Kunz added a comment - I would be happy with Amareshwari's proposal together with Owen's suggestion. I believe this would help us a lot in our environment. To only count blacklisted TaskTrackers for successful jobs seems necessary to avoid false positives because of application issues (although we just had a case where enough bad-behaving TaskTrackers generated enough individual task failures to fail jobs repeatedly). Solutions that cover 80% of issues with 20% of development effort are better than perfect solutions that are never implemented.
          Hide
          steve_l added a comment -

          I'd be happiest if there was some way of reporting this to some policy component that made the right decision. Because the action you take on a managed-VM cluster is different from hadoop on physical. On physical, you blacklist and maybe trigger a reboot. Or you start running well-known health tasks to see which parts of the system appear healthy. On a VM cluster you just delete that node and create a new one -no need to faff around with the state of the VM if it is a task-only VM; if its also a datanode you have to decommission it first.

          Show
          steve_l added a comment - I'd be happiest if there was some way of reporting this to some policy component that made the right decision. Because the action you take on a managed-VM cluster is different from hadoop on physical. On physical, you blacklist and maybe trigger a reboot. Or you start running well-known health tasks to see which parts of the system appear healthy. On a VM cluster you just delete that node and create a new one -no need to faff around with the state of the VM if it is a task-only VM; if its also a datanode you have to decommission it first.
          Hide
          Devaraj Das added a comment -

          Runping, the case we need to consider is faulty tasks (as Koji had pointed out here - https://issues.apache.org/jira/browse/HADOOP-4305?focusedCommentId=12641556#action_12641556). A tasktracker should not stop asking for tasks if the failures were due to task faults (e.g. buggy user code). So that's why the proposal here is to increment the blacklist-count for a TT only for successful jobs. If a job fails, the blacklist-count is not incremented even if the TT got blacklisted for this job...

          Show
          Devaraj Das added a comment - Runping, the case we need to consider is faulty tasks (as Koji had pointed out here - https://issues.apache.org/jira/browse/HADOOP-4305?focusedCommentId=12641556#action_12641556 ). A tasktracker should not stop asking for tasks if the failures were due to task faults (e.g. buggy user code). So that's why the proposal here is to increment the blacklist-count for a TT only for successful jobs. If a job fails, the blacklist-count is not incremented even if the TT got blacklisted for this job...
          Hide
          Runping Qi added a comment -

          The number of slots per node is a guess-work at best and it just means in normal case the TT can run that many tasks of typical jobs concurrently.
          However, tasks of different jobs may need different resources. Thus, when that many tasks needing large resources run consurrently, some task may fail.
          However, the TT may work fine if fewer tasks run concurrently.
          So in genral, you should reduce the max allowed concurrent tasks when some tasks fail.
          If the failing continues, that number may eventually reach to zero. That is effectively equivalent to be blacklisted.

          After certain period of time without failures, that number should be incremented gradually.
          Ths way, TT will adapt to various load/resource situation automously.

          Show
          Runping Qi added a comment - The number of slots per node is a guess-work at best and it just means in normal case the TT can run that many tasks of typical jobs concurrently. However, tasks of different jobs may need different resources. Thus, when that many tasks needing large resources run consurrently, some task may fail. However, the TT may work fine if fewer tasks run concurrently. So in genral, you should reduce the max allowed concurrent tasks when some tasks fail. If the failing continues, that number may eventually reach to zero. That is effectively equivalent to be blacklisted. After certain period of time without failures, that number should be incremented gradually. Ths way, TT will adapt to various load/resource situation automously.
          Hide
          Devaraj Das added a comment -

          Runping, I think taking the average blacklist count on a per tracker basis, and penalizing only those TTs way above the average should help even in this scenario. So for example, if a TT is really faulty, it's blacklist-count should be way above the average number of blacklist-count per tracker, and this would be penalized. The other case is where only certain tasks fail due to resource limitations and the TT gets blacklisted for none of its fault, but IMO in a practical setup, this problem would affect many other TTs as well, and hence the average blacklist-count would be a bit higher... Makes sense?

          Show
          Devaraj Das added a comment - Runping, I think taking the average blacklist count on a per tracker basis, and penalizing only those TTs way above the average should help even in this scenario. So for example, if a TT is really faulty, it's blacklist-count should be way above the average number of blacklist-count per tracker, and this would be penalized. The other case is where only certain tasks fail due to resource limitations and the TT gets blacklisted for none of its fault, but IMO in a practical setup, this problem would affect many other TTs as well, and hence the average blacklist-count would be a bit higher... Makes sense?
          Hide
          Amareshwari Sriramadasu added a comment -

          If the TaskTracker is configured to run 10 tasks at a time, and task fails because of too many tasks, the problem is with TaskTracker (its configuration). Right?

          Show
          Amareshwari Sriramadasu added a comment - If the TaskTracker is configured to run 10 tasks at a time, and task fails because of too many tasks, the problem is with TaskTracker (its configuration). Right?
          Hide
          Runping Qi added a comment -

          I still think it is not enough to simply count the number of failed tasks.
          We need to know under what conditions the task failed.
          Consider two cases. One is that a task of a job failed when it was the only task ran on a TT.
          Another case is a task of a job failed when another 10 tasks ran concurrently.
          Case one provides a strong evidence that this TT may not appropriate for the task for that job.
          However, Case two is much weaker. The task might have succeeded if there were fewer concurrent tasks.

          Show
          Runping Qi added a comment - I still think it is not enough to simply count the number of failed tasks. We need to know under what conditions the task failed. Consider two cases. One is that a task of a job failed when it was the only task ran on a TT. Another case is a task of a job failed when another 10 tasks ran concurrently. Case one provides a strong evidence that this TT may not appropriate for the task for that job. However, Case two is much weaker. The task might have succeeded if there were fewer concurrent tasks.
          Hide
          Devaraj Das added a comment -

          What Owen proposed can be done in addition to what Amareshwari proposed. Specifically, disabling TT based on the average number of failures helps since
          1) we prevent massive disablement
          2) we penalize only those TTs that are performing badly relative to others in the cluster
          Makes sense ?

          Show
          Devaraj Das added a comment - What Owen proposed can be done in addition to what Amareshwari proposed. Specifically, disabling TT based on the average number of failures helps since 1) we prevent massive disablement 2) we penalize only those TTs that are performing badly relative to others in the cluster Makes sense ?
          Hide
          Owen O'Malley added a comment -

          One thing that concerns me, is that there needs to be a policy for degrading the information across time. How about a policy where we define the max number of black lists (from successful jobs) you can be on and still get new tasks. Furthermore, once each day the counters are decremented by 1. (To avoid a massive re-enablement, I'd probably use hostname.hashcode() % 24 as the hour to decrement the count for each host. So if you set the threshold to 4, it would give you:
          1. Any TT black listed by 4 (or more) jobs would not get new tasks.
          2. Each day, the JT would forgive one blacklist and likely re-enable the TT
          3. A re-enabled TT would only get one chance for that day.

          Thoughts?

          Show
          Owen O'Malley added a comment - One thing that concerns me, is that there needs to be a policy for degrading the information across time. How about a policy where we define the max number of black lists (from successful jobs) you can be on and still get new tasks. Furthermore, once each day the counters are decremented by 1. (To avoid a massive re-enablement, I'd probably use hostname.hashcode() % 24 as the hour to decrement the count for each host. So if you set the threshold to 4, it would give you: 1. Any TT black listed by 4 (or more) jobs would not get new tasks. 2. Each day, the JT would forgive one blacklist and likely re-enable the TT 3. A re-enabled TT would only get one chance for that day. Thoughts?
          Hide
          dhruba borthakur added a comment -

          Amareshwari's proposal looks good to me. We should make the TT come back to life after T seconds. Maybe we should issue a re-initialize command to the TT after T seconds and remove it from the blacklist.

          Show
          dhruba borthakur added a comment - Amareshwari's proposal looks good to me. We should make the TT come back to life after T seconds. Maybe we should issue a re-initialize command to the TT after T seconds and remove it from the blacklist.
          Hide
          Devaraj Das added a comment -

          Runping, yes, the TT can remember that. But the JT has the overall knowledge about the cluster (TTs and jobs) and is in a better position to decide whether to give a certain task to a TT or not, right ?

          Show
          Devaraj Das added a comment - Runping, yes, the TT can remember that. But the JT has the overall knowledge about the cluster (TTs and jobs) and is in a better position to decide whether to give a certain task to a TT or not, right ?
          Hide
          Runping Qi added a comment -

          certainly the TT can remember the jobs of failed tasks and decide whether to refuse tasks of those jobs only or refuse tasks of all jobs.

          Show
          Runping Qi added a comment - certainly the TT can remember the jobs of failed tasks and decide whether to refuse tasks of those jobs only or refuse tasks of all jobs.
          Hide
          Amareshwari Sriramadasu added a comment -

          The problem if we do it at the TaskTracker is that it does not know if the application is buggy. So, buggy code can bring down all the tasktrackers.

          Another approach is:

          • JT blacklists TTs on per job basis as is today.
          • When a successful job blacklist a tracker, we add the tracker to a potentially-faulty list. For each tracker, the number of jobs that blacklisted it (#blacklists) will be maintained.
          • The tracker is blacklisted across all jobs if #blacklists is X% above the average #blacklists, over all the trackers.
            For example, In cluster with 100 trackers,
            Tracker   #BlackLists
            TT1   10
            TT2   10
            TT3   10
            TT4   7
            TT5   5

            With X=25%, TT1, TT2 and TT3 will be blacklisted across all the jobs.

          • We can reconsider the tracker after time, T, or when it restarts.
          • No more than 50% of the trackers can get blacklisted on the cluster.

          Thoughts?

          Show
          Amareshwari Sriramadasu added a comment - The problem if we do it at the TaskTracker is that it does not know if the application is buggy. So, buggy code can bring down all the tasktrackers. Another approach is: JT blacklists TTs on per job basis as is today. When a successful job blacklist a tracker, we add the tracker to a potentially-faulty list. For each tracker, the number of jobs that blacklisted it (#blacklists) will be maintained. The tracker is blacklisted across all jobs if #blacklists is X% above the average #blacklists, over all the trackers. For example, In cluster with 100 trackers, Tracker   #BlackLists TT1   10 TT2   10 TT3   10 TT4   7 TT5   5 With X=25%, TT1, TT2 and TT3 will be blacklisted across all the jobs. We can reconsider the tracker after time, T, or when it restarts. No more than 50% of the trackers can get blacklisted on the cluster. Thoughts?
          Hide
          Runping Qi added a comment -

          I think TaskTracker is at a better position to decide whether it can accept more tasks or not.
          If not, whether/when to shut down itself. If yes, what kind of tasks it can accept.

          Many reasons may cause task failing on a node.
          Most common one is resource limit.
          A task tracker may be ok to run one task, but may fail to run more tasks simutaneously.
          A task tracker may be ok to run 4 concurrent tasks of one job, but may fail to run 3 concurrent tasks of other jobs.
          The task tracker should accumulate some stats and make decision based on the stats.

          A simple heuristics may run like this: When a task fail, the task tracker records the concurrent number of tasks running on the tracker at that time
          If multiple tasks fail at the same concurrence level, the TT should stop asking for new tasks until the concurrence level drop lower.
          After running for a while without task failure, it can bump up the concurrence threahold level again.

          Show
          Runping Qi added a comment - I think TaskTracker is at a better position to decide whether it can accept more tasks or not. If not, whether/when to shut down itself. If yes, what kind of tasks it can accept. Many reasons may cause task failing on a node. Most common one is resource limit. A task tracker may be ok to run one task, but may fail to run more tasks simutaneously. A task tracker may be ok to run 4 concurrent tasks of one job, but may fail to run 3 concurrent tasks of other jobs. The task tracker should accumulate some stats and make decision based on the stats. A simple heuristics may run like this: When a task fail, the task tracker records the concurrent number of tasks running on the tracker at that time If multiple tasks fail at the same concurrence level, the TT should stop asking for new tasks until the concurrence level drop lower. After running for a while without task failure, it can bump up the concurrence threahold level again.
          Hide
          Devaraj Das added a comment -

          I think Dhruba's suggestion makes sense. +1 for that. Vinod's suggestion seems to require users to know about this knob which I think is not called for at this point.

          Show
          Devaraj Das added a comment - I think Dhruba's suggestion makes sense. +1 for that. Vinod's suggestion seems to require users to know about this knob which I think is not called for at this point.
          Hide
          Amareshwari Sriramadasu added a comment -

          Can we count this only from successful jobs?

          +1.

          If we adopt a proposal that if a TT fails consecutive tasks from N different jobs, then the tasktracker is shutdown. Any successful execution of any task from any job resets the counter associated with this tasktracker.

          This involves a lot of book keeping at JobTracker, since counter should be maintained for all the trackers. And every task completion and failure has to update the counters.

          Show
          Amareshwari Sriramadasu added a comment - Can we count this only from successful jobs? +1. If we adopt a proposal that if a TT fails consecutive tasks from N different jobs, then the tasktracker is shutdown. Any successful execution of any task from any job resets the counter associated with this tasktracker. This involves a lot of book keeping at JobTracker, since counter should be maintained for all the trackers. And every task completion and failure has to update the counters.
          Hide
          dhruba borthakur added a comment -

          If we adopt a proposal that if a TT fails consecutive tasks from N different jobs, then the tasktracker is shutdown. Any successful execution of any task from any job resets the counter associated with this tasktracker. Vinod's proposal seems to be a generalization of this approach.

          Will this simple algorithm work?

          Show
          dhruba borthakur added a comment - If we adopt a proposal that if a TT fails consecutive tasks from N different jobs, then the tasktracker is shutdown. Any successful execution of any task from any job resets the counter associated with this tasktracker. Vinod's proposal seems to be a generalization of this approach. Will this simple algorithm work?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Can we count this only from successful jobs?

          Even that wouldn't be enough. Blacklisting of a TT by a successful job would only mean that this TT is not suitable for running this job. We can't generalize it to say that this TT is not fit for running any job. The later can be concluded only by monitoring TT health, which should be done independently of job failures.

          The proposal here doesn't seem to be a right fix. If we are concerned about batch jobs(similar jobs), and of same jobs being repetitively submitted, we can addressing the issue by introducing the concept of a batch and by linking batch jobs by something like a 'batch-id'. By default all jobs would belong to the default batch. And then, we can consider this batch-id for blacklisting TTs. Thoughts?

          Show
          Vinod Kumar Vavilapalli added a comment - Can we count this only from successful jobs? Even that wouldn't be enough. Blacklisting of a TT by a successful job would only mean that this TT is not suitable for running this job. We can't generalize it to say that this TT is not fit for running any job. The later can be concluded only by monitoring TT health, which should be done independently of job failures. The proposal here doesn't seem to be a right fix. If we are concerned about batch jobs(similar jobs), and of same jobs being repetitively submitted, we can addressing the issue by introducing the concept of a batch and by linking batch jobs by something like a 'batch-id'. By default all jobs would belong to the default batch. And then, we can consider this batch-id for blacklisting TTs. Thoughts?
          Hide
          Koji Noguchi added a comment -

          I'm afraid what would happen when application has a bug and it gets submitted many times to the cluster.
          Could this job blacklist the healthy nodes and eventually take the TaskTrackers down?

          If the number of times the task tracker got blacklisted is greater than or equal to mapred.max.tasktracker.blacklists, then the job tracker declares the task tracker as dead.

          Can we count this only from successful jobs?

          Show
          Koji Noguchi added a comment - I'm afraid what would happen when application has a bug and it gets submitted many times to the cluster. Could this job blacklist the healthy nodes and eventually take the TaskTrackers down? If the number of times the task tracker got blacklisted is greater than or equal to mapred.max.tasktracker.blacklists, then the job tracker declares the task tracker as dead. Can we count this only from successful jobs?
          Hide
          dhruba borthakur added a comment -

          I vote for Option 1. If a TaskTracker is found faulty, shut it down. This is similar to the behaviour of a Datanode.

          If an admin restarts a task-tarcker, then it should probably join the cluster as a "healthy" task tracker.

          Show
          dhruba borthakur added a comment - I vote for Option 1. If a TaskTracker is found faulty, shut it down. This is similar to the behaviour of a Datanode. If an admin restarts a task-tarcker, then it should probably join the cluster as a "healthy" task tracker.
          Hide
          Amareshwari Sriramadasu added a comment -

          Another question : Does the restart of a task tarcker say it is healthy? Or should we make admin explicitly say that the task tracker is healthy?

          Irrespective of all this, Map/Reduce should have a utility similar to dfsadmin -refreshNodes , to add and delete trackers to cluster anytime.

          This could be a separate jira, if it is really required.

          Show
          Amareshwari Sriramadasu added a comment - Another question : Does the restart of a task tarcker say it is healthy? Or should we make admin explicitly say that the task tracker is healthy? Irrespective of all this, Map/Reduce should have a utility similar to dfsadmin -refreshNodes , to add and delete trackers to cluster anytime. This could be a separate jira, if it is really required.
          Hide
          Amareshwari Sriramadasu added a comment - - edited

          I propose the following for declaring tasktrackers dead:

          • When to declare a task tracker dead? :
            If the number of times the task tracker got blacklisted is greater than or equal to mapred.max.tasktracker.blacklists, then the job tracker declares the task tracker as dead.
          • What to do with the dead task tracker? :
            • Option 1:
              • send DisallowedTaskTrackerException to the task tracker in the heartbeat. Then task tracker shuts down.
              • Kill the tasks running on dead tracker and re-execute them. (Similar behavior as lost task tracker)
            • Option 2:
              • Honor the tasks running the dead tracker also. But do not schedule any more jobs or tasks on it. (Same behavior as black listed tracker, but it is across all the jobs).
          • How to bring back the dead tracker? :
            • With Option 1, restarting the task tracker adds it back to the cluster.
            • With Option 2, There should be an admin command which removes the tracker from deadList.

          I think we should go with Option 1, because it is actually making it die and come up. Thoughts?

          Irrespective of all this, Map/Reduce should have a utility similar to dfsadmin -refreshNodes , to add and delete trackers to cluster anytime.

          Show
          Amareshwari Sriramadasu added a comment - - edited I propose the following for declaring tasktrackers dead: When to declare a task tracker dead? : If the number of times the task tracker got blacklisted is greater than or equal to mapred.max.tasktracker.blacklists , then the job tracker declares the task tracker as dead. What to do with the dead task tracker? : Option 1: send DisallowedTaskTrackerException to the task tracker in the heartbeat. Then task tracker shuts down. Kill the tasks running on dead tracker and re-execute them. (Similar behavior as lost task tracker) Option 2: Honor the tasks running the dead tracker also. But do not schedule any more jobs or tasks on it. (Same behavior as black listed tracker, but it is across all the jobs). How to bring back the dead tracker? : With Option 1, restarting the task tracker adds it back to the cluster. With Option 2, There should be an admin command which removes the tracker from deadList. I think we should go with Option 1 , because it is actually making it die and come up. Thoughts? Irrespective of all this, Map/Reduce should have a utility similar to dfsadmin -refreshNodes , to add and delete trackers to cluster anytime.
          Hide
          dhruba borthakur added a comment -

          +1. I have seen that scenario occur a lot in our cluster. Some details in HADOOP-2676.

          Show
          dhruba borthakur added a comment - +1. I have seen that scenario occur a lot in our cluster. Some details in HADOOP-2676 .

            People

            • Assignee:
              Amareshwari Sriramadasu
              Reporter:
              Christian Kunz
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development