|
[
Permlink
| « Hide
]
dhruba borthakur added a comment - 30/Sep/08 01:14 AM
+1. I have seen that scenario occur a lot in our cluster. Some details in HADOOP-2676.
I propose the following for declaring tasktrackers dead:
I think we should go with Option 1, because it is actually making it die and come up. Thoughts? Irrespective of all this, Map/Reduce should have a utility similar to dfsadmin -refreshNodes , to add and delete trackers to cluster anytime. Another question : Does the restart of a task tarcker say it is healthy? Or should we make admin explicitly say that the task tracker is healthy?
This could be a separate jira, if it is really required. I vote for Option 1. If a TaskTracker is found faulty, shut it down. This is similar to the behaviour of a Datanode.
If an admin restarts a task-tarcker, then it should probably join the cluster as a "healthy" task tracker. I'm afraid what would happen when application has a bug and it gets submitted many times to the cluster.
Could this job blacklist the healthy nodes and eventually take the TaskTrackers down?
Can we count this only from successful jobs? Can we count this only from successful jobs?
The proposal here doesn't seem to be a right fix. If we are concerned about batch jobs(similar jobs), and of same jobs being repetitively submitted, we can addressing the issue by introducing the concept of a batch and by linking batch jobs by something like a 'batch-id'. By default all jobs would belong to the default batch. And then, we can consider this batch-id for blacklisting TTs. Thoughts? If we adopt a proposal that if a TT fails consecutive tasks from N different jobs, then the tasktracker is shutdown. Any successful execution of any task from any job resets the counter associated with this tasktracker. Vinod's proposal seems to be a generalization of this approach.
Will this simple algorithm work?
+1.
This involves a lot of book keeping at JobTracker, since counter should be maintained for all the trackers. And every task completion and failure has to update the counters. I think Dhruba's suggestion makes sense. +1 for that. Vinod's suggestion seems to require users to know about this knob which I think is not called for at this point.
I think TaskTracker is at a better position to decide whether it can accept more tasks or not. Many reasons may cause task failing on a node. A simple heuristics may run like this: When a task fail, the task tracker records the concurrent number of tasks running on the tracker at that time The problem if we do it at the TaskTracker is that it does not know if the application is buggy. So, buggy code can bring down all the tasktrackers.
Another approach is:
Thoughts? certainly the TT can remember the jobs of failed tasks and decide whether to refuse tasks of those jobs only or refuse tasks of all jobs. Runping, yes, the TT can remember that. But the JT has the overall knowledge about the cluster (TTs and jobs) and is in a better position to decide whether to give a certain task to a TT or not, right ?
Amareshwari's proposal looks good to me. We should make the TT come back to life after T seconds. Maybe we should issue a re-initialize command to the TT after T seconds and remove it from the blacklist.
One thing that concerns me, is that there needs to be a policy for degrading the information across time. How about a policy where we define the max number of black lists (from successful jobs) you can be on and still get new tasks. Furthermore, once each day the counters are decremented by 1. (To avoid a massive re-enablement, I'd probably use hostname.hashcode() % 24 as the hour to decrement the count for each host. So if you set the threshold to 4, it would give you:
1. Any TT black listed by 4 (or more) jobs would not get new tasks. 2. Each day, the JT would forgive one blacklist and likely re-enable the TT 3. A re-enabled TT would only get one chance for that day. Thoughts? What Owen proposed can be done in addition to what Amareshwari proposed. Specifically, disabling TT based on the average number of failures helps since
1) we prevent massive disablement 2) we penalize only those TTs that are performing badly relative to others in the cluster Makes sense ? I still think it is not enough to simply count the number of failed tasks. If the TaskTracker is configured to run 10 tasks at a time, and task fails because of too many tasks, the problem is with TaskTracker (its configuration). Right?
Runping, I think taking the average blacklist count on a per tracker basis, and penalizing only those TTs way above the average should help even in this scenario. So for example, if a TT is really faulty, it's blacklist-count should be way above the average number of blacklist-count per tracker, and this would be penalized. The other case is where only certain tasks fail due to resource limitations and the TT gets blacklisted for none of its fault, but IMO in a practical setup, this problem would affect many other TTs as well, and hence the average blacklist-count would be a bit higher... Makes sense?
The number of slots per node is a guess-work at best and it just means in normal case the TT can run that many tasks of typical jobs concurrently. After certain period of time without failures, that number should be incremented gradually. Runping, the case we need to consider is faulty tasks (as Koji had pointed out here - https://issues.apache.org/jira/browse/HADOOP-4305?focusedCommentId=12641556#action_12641556
I'd be happiest if there was some way of reporting this to some policy component that made the right decision. Because the action you take on a managed-VM cluster is different from hadoop on physical. On physical, you blacklist and maybe trigger a reboot. Or you start running well-known health tasks to see which parts of the system appear healthy. On a VM cluster you just delete that node and create a new one -no need to faff around with the state of the VM if it is a task-only VM; if its also a datanode you have to decommission it first.
I would be happy with Amareshwari's proposal together with Owen's suggestion. I believe this would help us a lot in our environment.
To only count blacklisted TaskTrackers for successful jobs seems necessary to avoid false positives because of application issues (although we just had a case where enough bad-behaving TaskTrackers generated enough individual task failures to fail jobs repeatedly). Solutions that cover 80% of issues with 20% of development effort are better than perfect solutions that are never implemented. I like Amareshwati's proposal because it is simple. Owen's extension seems to add a "age-ing" factor to the counter. And, Runping's proposal can be encompassed into Amareshwari's proposal too.....reflect the state of the TT (how many jobs was it running simultaneously) by incrementing the blacklist counter with an appropriate weight.
I think you meant how many tasks the tracker was running simultaneously at the time of failure. But, in steady state all the slots of the tracker will be occupied. then, the blacklist weight would be same for all the trackers.
Here, The average value may be very skewed, since very few trackers would be faulty. (In my previous example, it should be X=2500%) Here is a patch with proposed fix.
The patch does the following:
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12393865/patch-4305-1.txt against trunk revision 713893. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3592/testReport/ This message is automatically generated. Christian, here is a patch for 0.18. Can you validate the patch on your cluster?
1. Change ConvertTrackerNameToHostName not to include "tracker_" and remove the extra List introduced in JobInProgress.
2. Faults in PotentiallyFaultyList should also get decremented if there are no faults on that tracker for 24 hours. 3. Make AVERAGE_BLACKLST_THRESHOLD configurable. but do not expose it outside (this will give us a way to tune until we reach the correct number). 4. Put more comments in code explaining the algorithm. 5. change the name of addFaultyTracker to incrementFaults 6. ReinitAction on lostTaskTracker should not erase its faults. Patch incorporating review comments.
Can somebody please review this? -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12394532/patch-4305-2.txt against trunk revision 720632. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3646/testReport/ This message is automatically generated. The only test failure org.apache.hadoop.hdfsproxy.TestHdfsProxy.testHdfsProxyInterface is not related to the patch. All the core and contrig tests passed on my machine.
Some comments:
1. Format if condition brackets properly in incrementFaults method 2. You should be able to use the same datastructure for both potentiallyFaulty and blacklisted trackers. 3. Add a comment for mapred.cluster.average.blacklist.threshold that it is there solely for tuning purposes and once this feature has been tested in real clusters and an appropriate value for the threshold has been found, this config might be taken out. 4. Check whether you can remove initialContact flag and use only the restarted flag in the heartbeat method. This is a more serious change but might be worthwhile in simplifying the state machine. Attaching patch for incorporating review comments.
test-patch result on trunk:
[exec]
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no tests are needed for this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
All core and contrib tests passed on my machine. It is very difficult to write junit test for this. Did manual tests.
We applied the patch to hadoop-0.18 and verified that two TaskTrackers repeatedly blacklisted disappeared from the list of active TaskTrackers:
2008-12-04 02:46:04,251 INFO org.apache.hadoop.mapred.JobTracker: Adding tracker_xxxto the blackList These two TaskTrackers indeed had disk problems and the corresponding DataNodes were not active as well. Great Job! +1 for heartbeat related changes in JT and TT.
Added junit testcase to test the blacklisting strategy.
test-patch result: [exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 7 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
I just committed this. Thanks Amareshwari!
Integrated in Hadoop-trunk #680 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/680/
Edit release note for publication.
Improves the blacklisting strategy, whereby, tasktrackers that are blacklisted are not given tasks to run from other jobs, subject to the following conditions (all must be met): |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||