Hadoop Common
  1. Hadoop Common
  2. HADOOP-4638

Exception thrown in/from RecoveryManager.recover() should be caught and handled

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.19.2, 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      RecoveryManager.recover() can throw an exception while recovering a job. Since the JobTracker calls RecoveryManager.recover() from offerService(), any failure in recovery will cause JobTracker to crash. Ideally the RecoveryManager should log the failure encountered while recovering the job and continue.

      1. HADOOP-4638-v1.8.patch
        14 kB
        Amar Kamat
      2. HADOOP-4638-v1.8.5.patch
        12 kB
        Amar Kamat
      3. HADOOP-4638-v1.6.patch
        13 kB
        Amar Kamat
      4. HADOOP-4638-v1.3.patch
        13 kB
        Amar Kamat
      5. HADOOP-4638-v1.1.patch
        12 kB
        Amar Kamat

        Activity

        Hide
        Amar Kamat added a comment -

        Attaching a patch that prevents RecoveryManager from taking down the jobtracker. Added a testcase to test it.

        Show
        Amar Kamat added a comment - Attaching a patch that prevents RecoveryManager from taking down the jobtracker. Added a testcase to test it.
        Hide
        Amar Kamat added a comment -

        Attaching a new patch.

        Show
        Amar Kamat added a comment - Attaching a new patch.
        Hide
        Amar Kamat added a comment -

        Result of test-patch

        [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        
        
        Show
        Amar Kamat added a comment - Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Amar Kamat added a comment -

        Attaching a new patch. Result of test-patch

        [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        

        Ant test passes on my box.

        Show
        Amar Kamat added a comment - Attaching a new patch. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Ant test passes on my box.
        Hide
        Amareshwari Sriramadasu added a comment -

        If the JobTracker fails to recover a job, it should do a job.kill() (essentially kill the job). This will kill all the tips and do a finalizeJob(). Then patch doesnt need the change to do with updateTaskStatuses.

        Show
        Amareshwari Sriramadasu added a comment - If the JobTracker fails to recover a job, it should do a job.kill() (essentially kill the job). This will kill all the tips and do a finalizeJob(). Then patch doesnt need the change to do with updateTaskStatuses.
        Hide
        Amar Kamat added a comment -

        Amareshwari,
        The job gets ignored before it gets added to jobtracker (i.e if the filename is not recoverable or restoration of master file fails). Once the filename is recovered, the recovery manager recovers whatever it can and continues. There is no killing done after that. Hence there is no need to do a job.kill().

        Show
        Amar Kamat added a comment - Amareshwari, The job gets ignored before it gets added to jobtracker (i.e if the filename is not recoverable or restoration of master file fails). Once the filename is recovered, the recovery manager recovers whatever it can and continues. There is no killing done after that. Hence there is no need to do a job.kill().
        Hide
        Amar Kamat added a comment -

        The change in updateTaskStatuses takes care of the case where the job which was running in the earlier jobtracker goes missing/undetected during recovery in the new jobtracker. Hence every job that is missing in the jobtracker should get removed from the tasktracker. Hence the change is required.

        Show
        Amar Kamat added a comment - The change in updateTaskStatuses takes care of the case where the job which was running in the earlier jobtracker goes missing/undetected during recovery in the new jobtracker. Hence every job that is missing in the jobtracker should get removed from the tasktracker. Hence the change is required.
        Hide
        Amareshwari Sriramadasu added a comment -

        Framework changes look good.
        Some comments in Testcase:
        1. testJobTracker() does not have any assertion in the test.
        2. Comments for testRecoveryManager() look different from implementation. Also can you add the assertion that Job1 is ignored and Job2 succeeded?

        Show
        Amareshwari Sriramadasu added a comment - Framework changes look good. Some comments in Testcase: 1. testJobTracker() does not have any assertion in the test. 2. Comments for testRecoveryManager() look different from implementation. Also can you add the assertion that Job1 is ignored and Job2 succeeded?
        Hide
        Amar Kamat added a comment -

        Incorporated Amareshwari's comments. Result of test-patch

        [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        
        Show
        Amar Kamat added a comment - Incorporated Amareshwari's comments. Result of test-patch [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Amareshwari Sriramadasu added a comment -

        +1 patch looks good.

        Show
        Amareshwari Sriramadasu added a comment - +1 patch looks good.
        Hide
        Hemanth Yamijala added a comment -

        I committed this to trunk and the 0.19 and 0.20 branches. Thanks, Amar !

        Show
        Hemanth Yamijala added a comment - I committed this to trunk and the 0.19 and 0.20 branches. Thanks, Amar !
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ )

          People

          • Assignee:
            Amar Kamat
            Reporter:
            Amar Kamat
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development