Issue Details (XML | Word | Printable)

Key: HADOOP-3780
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Amar Kamat
Reporter: Amar Kamat
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

JobTracker should synchronously resolve the tasktracker's network location when the tracker registers

Created: 17/Jul/08 10:03 AM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.18.3, 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-3780-v1.1.patch 2008-07-21 05:24 AM Amar Kamat 4 kB
Text File Licensed for inclusion in ASF works HADOOP-3780-v1.2.patch 2008-07-21 02:36 PM Amar Kamat 5 kB
Text File Licensed for inclusion in ASF works HADOOP-3780-v1.patch 2008-07-17 01:25 PM Amar Kamat 4 kB
Text File Licensed for inclusion in ASF works HADOOP18-3780.patch 2008-12-26 04:40 AM Ravi Gummadi 6 kB
Issue Links:
Blocker
 
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 12/Aug/08 09:46 PM


 Description  « Hide
This issue is inspired by HADOOP-3620. In JobTracker, the network address of tracker gets resolved asynchronously. Now it can be done inline i.e while the trackers register. This is of great help for HADOOP-3245 where this enhancement makes the design simpler.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Amar Kamat added a comment - 17/Jul/08 12:12 PM - edited
The reason why this issue is important for HADOOP-3245 is as follows :

Summary :
In HADOOP-3245 we are adding a new operation called SYNC operation. This directs the task tracker to upload its local state to the jobtracker. The whole design expects the SYNC operation to complete in one go. Partial updates can cause the JobTracker to be in an inconsistent state and might cause the job to get stuck. As of now, the only thing that can cause the SYNC operation to fail is an update from an unresolved tracker. Under such conditions the JT is partially updated, which breaks HADOOP-3245.

Info:

SYM Stands for Description Used for
IC Initial contact whether the TT is connected to the JT or not, TT's point of view Re-init/Sync the TT
SB Seen before whether there are some previous status entries Mark a TT as lost
HBE Heartbeat entry whether the TT is connected/registered, JT's point of view Re-init/Sync the TT
JTR JT restarted Whether the JT has restarted Re-init/Sync the TT

Rules :

IC HBE SB JTR Action
false false - true SYNC
false false - false Re-init
false true - - Re-send prev response
true - true - Mark lost (kill tasks)
false - false - make SB false i.e clear previous status entries

Description :

0) JT restarts and hence HBE for all TT's will be false.
1) TT connects to the restarted JT with IC=false.
2) JT sends a SYNC operation to the TT.
3) TT uploads the task statuses with IC = true.
4) JT (as a part of heartbeat) tries to update the task states/status.
5) If (4) is successful : JT makes an HBE=true for this TT.
6) If (4) fails : the JT has made some changes in the task states but HBE=false.
     Consider task t being marked as SUCCEEDED before the SYNC fails.
7) TT comes back with IC = false.
8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again.
9) (3) happens again.
10) (4) happens again. Since IC == true and SB == true, JT consider this TT as lost.
11) This causes the task t to be marked as KILLED.
12) In the same method the status updates are applied and hence t will be marked as SUCCEEDED.
13) Now we have task completion events with a same task marked as KILLED and SUCCEEDED.
14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is completed while the reducers keep 
      on ignoring the task t's output.
15) Job is stuck.

This problem will not occur if (4) succeeds without any problem i.e every SYNC should make HBE = true. 4 can only fail if the tracker is not resolved. Hence inline resolution solves the problem.


Devaraj Das added a comment - 17/Jul/08 01:04 PM
+1 for the synchronous resolution

Amar Kamat added a comment - 17/Jul/08 01:25 PM
Here is a patch the tries to get the resolution inline. Testing in progress.

Amar Kamat added a comment - 21/Jul/08 05:22 AM
Updated the patch to trunk.

Amar Kamat added a comment - 21/Jul/08 02:36 PM
Modified the test case to reflect the changes.

Amar Kamat added a comment - 21/Jul/08 02:37 PM
Updated ....

Hadoop QA added a comment - 21/Jul/08 06:19 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12386527/HADOOP-3780-v1.2.patch
against trunk revision 678196.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/console

This message is automatically generated.


Owen O'Malley added a comment - 12/Aug/08 09:46 PM
I just committed this. Thanks, Amar!

Hudson added a comment - 22/Aug/08 12:34 PM

Steve Loughran added a comment - 22/Aug/08 01:42 PM
This seems to render the member variable numResolved unimportant, and the method moot. The numResolved count is now always zero, so getNumResolved() == 0, so breaking any tests that used this to wait for the cluster to come up.

1. How can I count the #of task trackers under a job tracker?
2. Can this number be passed to getNumResolved() for BC, or can that be deleted
3. numResolved should be deleted; anyone that is using it needs to know their code has broken.


Steve Loughran added a comment - 22/Aug/08 01:54 PM
patching TaskTracker.getNumResolvedTaskTrackers() to return taskTrackers.size() appears to work; there's no need to make this synchronized.

public int getNumResolvedTaskTrackers() { return taskTrackers.size(); }

Is this the right thing to do? Should the method name stay the same?


Nigel Daley added a comment - 02/Dec/08 07:53 PM
Should this be fixed in 0.18.3 too?

Amareshwari Sriramadasu added a comment - 03/Dec/08 03:31 AM

Should this be fixed in 0.18.3 too?

+1


Ravi Gummadi added a comment - 26/Dec/08 04:40 AM
Attached the patch for branch 18.

[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

The javadoc warning is not related to this patch.

Unit tests also passed on my machine in branch 18.


Amar Kamat added a comment - 30/Dec/08 10:14 AM
+1.

Devaraj Das added a comment - 30/Dec/08 12:00 PM
I committed this patch to the 0.18 branch. Thanks Ravi!