Hadoop Common
  1. Hadoop Common
  2. HADOOP-3780

JobTracker should synchronously resolve the tasktracker's network location when the tracker registers

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.3, 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This issue is inspired by HADOOP-3620. In JobTracker, the network address of tracker gets resolved asynchronously. Now it can be done inline i.e while the trackers register. This is of great help for HADOOP-3245 where this enhancement makes the design simpler.

      1. HADOOP18-3780.patch
        6 kB
        Ravi Gummadi
      2. HADOOP-3780-v1.2.patch
        5 kB
        Amar Kamat
      3. HADOOP-3780-v1.1.patch
        4 kB
        Amar Kamat
      4. HADOOP-3780-v1.patch
        4 kB
        Amar Kamat

        Issue Links

          Activity

          Hide
          Devaraj Das added a comment -

          I committed this patch to the 0.18 branch. Thanks Ravi!

          Show
          Devaraj Das added a comment - I committed this patch to the 0.18 branch. Thanks Ravi!
          Hide
          Amar Kamat added a comment -

          +1.

          Show
          Amar Kamat added a comment - +1.
          Hide
          Ravi Gummadi added a comment -

          Attached the patch for branch 18.

          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          The javadoc warning is not related to this patch.

          Unit tests also passed on my machine in branch 18.

          Show
          Ravi Gummadi added a comment - Attached the patch for branch 18. [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. The javadoc warning is not related to this patch. Unit tests also passed on my machine in branch 18.
          Hide
          Amareshwari Sriramadasu added a comment -

          Should this be fixed in 0.18.3 too?

          +1

          Show
          Amareshwari Sriramadasu added a comment - Should this be fixed in 0.18.3 too? +1
          Hide
          Nigel Daley added a comment -

          Should this be fixed in 0.18.3 too?

          Show
          Nigel Daley added a comment - Should this be fixed in 0.18.3 too?
          Hide
          steve_l added a comment -

          patching TaskTracker.getNumResolvedTaskTrackers() to return taskTrackers.size() appears to work; there's no need to make this synchronized.

          public int getNumResolvedTaskTrackers()

          { return taskTrackers.size(); }

          Is this the right thing to do? Should the method name stay the same?

          Show
          steve_l added a comment - patching TaskTracker.getNumResolvedTaskTrackers() to return taskTrackers.size() appears to work; there's no need to make this synchronized. public int getNumResolvedTaskTrackers() { return taskTrackers.size(); } Is this the right thing to do? Should the method name stay the same?
          Hide
          steve_l added a comment -

          This seems to render the member variable numResolved unimportant, and the method moot. The numResolved count is now always zero, so getNumResolved() == 0, so breaking any tests that used this to wait for the cluster to come up.

          1. How can I count the #of task trackers under a job tracker?
          2. Can this number be passed to getNumResolved() for BC, or can that be deleted
          3. numResolved should be deleted; anyone that is using it needs to know their code has broken.

          Show
          steve_l added a comment - This seems to render the member variable numResolved unimportant, and the method moot. The numResolved count is now always zero, so getNumResolved() == 0, so breaking any tests that used this to wait for the cluster to come up. 1. How can I count the #of task trackers under a job tracker? 2. Can this number be passed to getNumResolved() for BC, or can that be deleted 3. numResolved should be deleted; anyone that is using it needs to know their code has broken.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #581 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/ )
          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Amar!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Amar!
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12386527/HADOOP-3780-v1.2.patch
          against trunk revision 678196.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12386527/HADOOP-3780-v1.2.patch against trunk revision 678196. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2915/console This message is automatically generated.
          Hide
          Amar Kamat added a comment -

          Updated ....

          Show
          Amar Kamat added a comment - Updated ....
          Hide
          Amar Kamat added a comment -

          Modified the test case to reflect the changes.

          Show
          Amar Kamat added a comment - Modified the test case to reflect the changes.
          Hide
          Amar Kamat added a comment -

          Updated the patch to trunk.

          Show
          Amar Kamat added a comment - Updated the patch to trunk.
          Hide
          Amar Kamat added a comment -

          Here is a patch the tries to get the resolution inline. Testing in progress.

          Show
          Amar Kamat added a comment - Here is a patch the tries to get the resolution inline. Testing in progress.
          Hide
          Devaraj Das added a comment -

          +1 for the synchronous resolution

          Show
          Devaraj Das added a comment - +1 for the synchronous resolution
          Hide
          Amar Kamat added a comment - - edited

          The reason why this issue is important for HADOOP-3245 is as follows :

          Summary :
          In HADOOP-3245 we are adding a new operation called SYNC operation. This directs the task tracker to upload its local state to the jobtracker. The whole design expects the SYNC operation to complete in one go. Partial updates can cause the JobTracker to be in an inconsistent state and might cause the job to get stuck. As of now, the only thing that can cause the SYNC operation to fail is an update from an unresolved tracker. Under such conditions the JT is partially updated, which breaks HADOOP-3245.

          Info:

          SYM Stands for Description Used for
          IC Initial contact whether the TT is connected to the JT or not, TT's point of view Re-init/Sync the TT
          SB Seen before whether there are some previous status entries Mark a TT as lost
          HBE Heartbeat entry whether the TT is connected/registered, JT's point of view Re-init/Sync the TT
          JTR JT restarted Whether the JT has restarted Re-init/Sync the TT

          Rules :

          IC HBE SB JTR Action
          false false - true SYNC
          false false - false Re-init
          false true - - Re-send prev response
          true - true - Mark lost (kill tasks)
          false - false - make SB false i.e clear previous status entries

          Description :

          0) JT restarts and hence HBE for all TT's will be false.
          1) TT connects to the restarted JT with IC=false.
          2) JT sends a SYNC operation to the TT.
          3) TT uploads the task statuses with IC = true.
          4) JT (as a part of heartbeat) tries to update the task states/status.
          5) If (4) is successful : JT makes an HBE=true for this TT.
          6) If (4) fails : the JT has made some changes in the task states but HBE=false.
               Consider task t being marked as SUCCEEDED before the SYNC fails.
          7) TT comes back with IC = false.
          8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again.
          9) (3) happens again.
          10) (4) happens again. Since IC == true and SB == true, JT consider this TT as lost.
          11) This causes the task t to be marked as KILLED.
          12) In the same method the status updates are applied and hence t will be marked as SUCCEEDED.
          13) Now we have task completion events with a same task marked as KILLED and SUCCEEDED.
          14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is completed while the reducers keep 
                on ignoring the task t's output.
          15) Job is stuck.
          

          This problem will not occur if (4) succeeds without any problem i.e every SYNC should make HBE = true. 4 can only fail if the tracker is not resolved. Hence inline resolution solves the problem.

          Show
          Amar Kamat added a comment - - edited The reason why this issue is important for HADOOP-3245 is as follows : Summary : In HADOOP-3245 we are adding a new operation called SYNC operation. This directs the task tracker to upload its local state to the jobtracker. The whole design expects the SYNC operation to complete in one go. Partial updates can cause the JobTracker to be in an inconsistent state and might cause the job to get stuck. As of now, the only thing that can cause the SYNC operation to fail is an update from an unresolved tracker. Under such conditions the JT is partially updated, which breaks HADOOP-3245 . Info: SYM Stands for Description Used for IC Initial contact whether the TT is connected to the JT or not, TT's point of view Re-init/Sync the TT SB Seen before whether there are some previous status entries Mark a TT as lost HBE Heartbeat entry whether the TT is connected/registered, JT's point of view Re-init/Sync the TT JTR JT restarted Whether the JT has restarted Re-init/Sync the TT Rules : IC HBE SB JTR Action false false - true SYNC false false - false Re-init false true - - Re-send prev response true - true - Mark lost (kill tasks) false - false - make SB false i.e clear previous status entries Description : 0) JT restarts and hence HBE for all TT's will be false. 1) TT connects to the restarted JT with IC=false. 2) JT sends a SYNC operation to the TT. 3) TT uploads the task statuses with IC = true. 4) JT (as a part of heartbeat) tries to update the task states/status. 5) If (4) is successful : JT makes an HBE=true for this TT. 6) If (4) fails : the JT has made some changes in the task states but HBE=false. Consider task t being marked as SUCCEEDED before the SYNC fails. 7) TT comes back with IC = false. 8) IC == false && HBE == false && JTR == true .... JT sends a SYNC again. 9) (3) happens again. 10) (4) happens again. Since IC == true and SB == true, JT consider this TT as lost. 11) This causes the task t to be marked as KILLED. 12) In the same method the status updates are applied and hence t will be marked as SUCCEEDED. 13) Now we have task completion events with a same task marked as KILLED and SUCCEEDED. 14) Since task t is marked as SUCCEEDED later, the JT assumes that the TIP is completed while the reducers keep on ignoring the task t's output. 15) Job is stuck. This problem will not occur if (4) succeeds without any problem i.e every SYNC should make HBE = true. 4 can only fail if the tracker is not resolved. Hence inline resolution solves the problem.

            People

            • Assignee:
              Amar Kamat
              Reporter:
              Amar Kamat
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development