Hadoop Common
  1. Hadoop Common
  2. HADOOP-5719

Jobs failed during job initalization are never removed from Capacity Schedulers waiting list

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Jobs which fail during initalization are never removed from Capacity Schedulers waiting job list.

      1. HADOOP-5719-5-20.patch
        7 kB
        Sreekanth Ramakrishnan
      2. HADOOP-5719-5.patch
        7 kB
        Sreekanth Ramakrishnan
      3. HADOOP-5719-4-20.patch
        4 kB
        Sreekanth Ramakrishnan
      4. HADOOP-5719-4.patch
        7 kB
        Sreekanth Ramakrishnan
      5. HADOOP-5719-3.patch
        6 kB
        Sreekanth Ramakrishnan
      6. HADOOP-5719-2.patch
        5 kB
        Sreekanth Ramakrishnan
      7. HADOOP-5719-1.patch
        3 kB
        Sreekanth Ramakrishnan

        Activity

        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk #828 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/828/)
        . Forgot to add a new file in the previous commit.
        . Remove jobs that failed initialization from the waiting queue in the capacity scheduler. Contributed by Sreekanth Ramakrishnan.

        Show
        Hudson added a comment - Integrated in Hadoop-trunk #828 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/828/ ) . Forgot to add a new file in the previous commit. . Remove jobs that failed initialization from the waiting queue in the capacity scheduler. Contributed by Sreekanth Ramakrishnan.
        Hide
        Hemanth Yamijala added a comment -

        I just committed this to trunk and branch 0.20. Thanks, Sreekanth !

        Show
        Hemanth Yamijala added a comment - I just committed this to trunk and branch 0.20. Thanks, Sreekanth !
        Hide
        Sreekanth Ramakrishnan added a comment -

        Output for ant test-patch for the latest attachment is as follows:

             [exec]
             [exec] +1 overall.
             [exec]
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec]
             [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
             [exec]
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec]
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec]
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec]
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec]
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
             [exec]
        
        Show
        Sreekanth Ramakrishnan added a comment - Output for ant test-patch for the latest attachment is as follows: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
        Hide
        Hemanth Yamijala added a comment -

        Changes look fine to me. +1.

        Show
        Hemanth Yamijala added a comment - Changes look fine to me. +1.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching patch incorporating Hemanth's offline comments.

        In TestJobInitialization

        • Changed the scheduler property from guaranteed-capacity to capacity.
        • Asserting if the submitted job is failed.

        Attaching both 20 branch and trunk patch

        Show
        Sreekanth Ramakrishnan added a comment - Attaching patch incorporating Hemanth's offline comments. In TestJobInitialization Changed the scheduler property from guaranteed-capacity to capacity. Asserting if the submitted job is failed. Attaching both 20 branch and trunk patch
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching patch for branch 20.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching patch for branch 20.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Output from ant test-patch:

             [exec]
             [exec]
             [exec] +1 overall.
             [exec]
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec]
             [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
             [exec]
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec]
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec]
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec]
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec]
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
             [exec]
        
        Show
        Sreekanth Ramakrishnan added a comment - Output from ant test-patch: [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        +1 for the patch.

        Show
        Vinod Kumar Vavilapalli added a comment - +1 for the patch.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching patch incorporating Vinod's offline comments:

        • Adding apache license to new test case file.
        • Changed the name of new test case from TestJobInitializationPoller to TestJobInitialization.
        • Corrected Typo of initialization the TestCapacityScheduler.
        • Removed unnecessary whitespaces.
        • Removed unused variable from TestJobInitialization
        • Removed unused import from TestCapacityScheduler
        Show
        Sreekanth Ramakrishnan added a comment - Attaching patch incorporating Vinod's offline comments: Adding apache license to new test case file. Changed the name of new test case from TestJobInitializationPoller to TestJobInitialization. Corrected Typo of initialization the TestCapacityScheduler. Removed unnecessary whitespaces. Removed unused variable from TestJobInitialization Removed unused import from TestCapacityScheduler
        Hide
        Sreekanth Ramakrishnan added a comment -

        Removing an unused import statement from JobInitializationPoller

        Show
        Sreekanth Ramakrishnan added a comment - Removing an unused import statement from JobInitializationPoller
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching patch incorporating most of Vinod's comments.

        I think a better place for job removal from the JobQueuesManager is the cleanUpInitializedJobsList() method of teh JobInitializationPoller. We may want to rename this method and change its javadoc a bit.

        This has not been incorporated because of issue described in HADOOP-5020 it is hit when JobInProgress.initTasks() throws an exception and terminate job is called and Capacity scheduler would never be able to remove the job from the waiting queue.

        Also added a new test case TestJobInitalizationPoller which uses MiniMRCluster to verify if jobs failing initialization are actually removed from waiting queue.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching patch incorporating most of Vinod's comments. I think a better place for job removal from the JobQueuesManager is the cleanUpInitializedJobsList() method of teh JobInitializationPoller. We may want to rename this method and change its javadoc a bit. This has not been incorporated because of issue described in HADOOP-5020 it is hit when JobInProgress.initTasks() throws an exception and terminate job is called and Capacity scheduler would never be able to remove the job from the waiting queue. Also added a new test case TestJobInitalizationPoller which uses MiniMRCluster to verify if jobs failing initialization are actually removed from waiting queue.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Few comments:

        • I think a better place for job removal from the JobQueuesManager is the cleanUpInitializedJobsList() method of teh JobInitializationPoller. We may want to rename this method and change its javadoc a bit.
        • We don't need the null check before job.fail() in the initialization-poller
        • Test-case doesn't compile(perhaps because of HADOOP-5726). Need to change the signature of FakeQueueInfo
        • In the test-case, the statements depicting the error scenarios have to be reversed. For e.g. at TestCapacityScheduler.java +2045
               assertFalse("Waiting job does not contain submitted job",  mgr.getWaitingJobs("default").contains(job)); 

          should instead be

           assertFalse("Waiting job contains submitted job", mgr.getWaitingJobs("default").contains(job)); 
        Show
        Vinod Kumar Vavilapalli added a comment - Few comments: I think a better place for job removal from the JobQueuesManager is the cleanUpInitializedJobsList() method of teh JobInitializationPoller. We may want to rename this method and change its javadoc a bit. We don't need the null check before job.fail() in the initialization-poller Test-case doesn't compile(perhaps because of HADOOP-5726 ). Need to change the signature of FakeQueueInfo In the test-case, the statements depicting the error scenarios have to be reversed. For e.g. at TestCapacityScheduler.java +2045 assertFalse( "Waiting job does not contain submitted job" , mgr.getWaitingJobs( " default " ).contains(job)); should instead be assertFalse( "Waiting job contains submitted job" , mgr.getWaitingJobs( " default " ).contains(job));
        Hide
        Sreekanth Ramakrishnan added a comment -

        The result of ant test-patch is :

             [exec] 
             [exec] 
             [exec] 
             [exec] +1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        
        Show
        Sreekanth Ramakrishnan added a comment - The result of ant test-patch is : [exec] [exec] [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Attaching a fix which addresses this issue. Alongwith test case which tests the condition.

        Show
        Sreekanth Ramakrishnan added a comment - Attaching a fix which addresses this issue. Alongwith test case which tests the condition.
        Hide
        Sreekanth Ramakrishnan added a comment -

        Currently job initialization is done in JobInitalizationPoller in the poller when an Exception is thrown while doing JobInProgress.initTasks() it does a JobInProgress.fail() but the fail does not inform all the job in progress listeners, resulting in job not being removed from the waiting job list of scheduler.

        Show
        Sreekanth Ramakrishnan added a comment - Currently job initialization is done in JobInitalizationPoller in the poller when an Exception is thrown while doing JobInProgress.initTasks() it does a JobInProgress.fail() but the fail does not inform all the job in progress listeners, resulting in job not being removed from the waiting job list of scheduler.

          People

          • Assignee:
            Sreekanth Ramakrishnan
            Reporter:
            Sreekanth Ramakrishnan
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development