Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.12.3
-
None
-
None
Description
Today whenever a tracker is 'lost' all the jobs which ever ran on it are considered as failures and added to the blacklist, which automatically ensures that the particular TT is never considered for allocating new tasks unless all tasktrackers are on the list. This results in an ugly situation where a majority of nodes in the cluster are on the blacklist and hence idle, while the other TTs are maxed out.
The proposal is two-fold:
a) Don't count all tasks which ever ran on the TT, we can count it as a 'single' task failure - which means that each 'lost' tracker results in a loss of 20% of the '5 failures == blacklisted' quota.
b) Stop adding nodes to the blacklist when a certain percentage of the cluster, say 25%, are already on the blacklist - adding more than that would just delay the inevitable i.e. there is something horrendously wrong with the cluster - we might as well fail the job early and noisily.
Thoughts?