|
+1 for the synchronous resolution
Here is a patch the tries to get the resolution inline. Testing in progress.
Modified the test case to reflect the changes.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386527/HADOOP-3780-v1.2.patch against trunk revision 678196. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/testReport/ This message is automatically generated. I just committed this. Thanks, Amar!
Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/
This seems to render the member variable numResolved unimportant, and the method moot. The numResolved count is now always zero, so getNumResolved() == 0, so breaking any tests that used this to wait for the cluster to come up.
1. How can I count the #of task trackers under a job tracker? patching TaskTracker.getNumResolvedTaskTrackers() to return taskTrackers.size() appears to work; there's no need to make this synchronized.
public int getNumResolvedTaskTrackers() { return taskTrackers.size(); } Is this the right thing to do? Should the method name stay the same? Should this be fixed in 0.18.3 too?
+1 Attached the patch for branch 18.
[exec] -1 overall. The javadoc warning is not related to this patch. Unit tests also passed on my machine in branch 18. I committed this patch to the 0.18 branch. Thanks Ravi!
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HADOOP-3245is as follows :Summary :
In
HADOOP-3245we are adding a new operation called SYNC operation. This directs the task tracker to upload its local state to the jobtracker. The whole design expects the SYNC operation to complete in one go. Partial updates can cause the JobTracker to be in an inconsistent state and might cause the job to get stuck. As of now, the only thing that can cause the SYNC operation to fail is an update from an unresolved tracker. Under such conditions the JT is partially updated, which breaksHADOOP-3245.Info:
Rules :
Description :
0) JT restarts and hence HBE for all TT's will be false. 1) TT connects to the restarted JT with IC=false. 2) JT sends a SYNC operation to the TT. 3) TT uploads the task statuses with IC = true. 4) JT (as a part of heartbeat) tries to update the task states/status. 5) If (4) is successful : JT makes an HBE=true for this TT. 6) If (4) fails : the JT has made some changes in the task states but HBE=false. Consider task t being marked as SUCCEEDED before the SYNC fails. 7) TT comes back with IC = false. 8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again. 9) (3) happens again. 10) (4) happens again. Since IC == true and SB == true, JT consider this TT as lost. 11) This causes the task t to be marked as KILLED. 12) In the same method the status updates are applied and hence t will be marked as SUCCEEDED. 13) Now we have task completion events with a same task marked as KILLED and SUCCEEDED. 14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is completed while the reducers keep on ignoring the task t's output. 15) Job is stuck.This problem will not occur if (4) succeeds without any problem i.e every SYNC should make HBE = true. 4 can only fail if the tracker is not resolved. Hence inline resolution solves the problem.