Issue Details (XML | Word | Printable)

Key: HADOOP-5719
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Sreekanth Ramakrishnan
Reporter: Sreekanth Ramakrishnan
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Jobs failed during job initalization are never removed from Capacity Schedulers waiting list

Created: 22/Apr/09 07:39 AM   Updated: 08/Jul/09 04:40 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.20.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-5719-1.patch 2009-04-22 08:28 AM Sreekanth Ramakrishnan 3 kB
Text File Licensed for inclusion in ASF works HADOOP-5719-2.patch 2009-05-05 11:59 AM Sreekanth Ramakrishnan 5 kB
Text File Licensed for inclusion in ASF works HADOOP-5719-3.patch 2009-05-05 12:06 PM Sreekanth Ramakrishnan 6 kB
Text File Licensed for inclusion in ASF works HADOOP-5719-4-20.patch 2009-05-05 01:42 PM Sreekanth Ramakrishnan 4 kB
Text File Licensed for inclusion in ASF works HADOOP-5719-4.patch 2009-05-05 01:12 PM Sreekanth Ramakrishnan 7 kB
Text File Licensed for inclusion in ASF works HADOOP-5719-5-20.patch 2009-05-05 05:09 PM Sreekanth Ramakrishnan 7 kB
Text File Licensed for inclusion in ASF works HADOOP-5719-5.patch 2009-05-05 05:09 PM Sreekanth Ramakrishnan 7 kB

Hadoop Flags: Reviewed
Resolution Date: 06/May/09 05:46 AM


 Description  « Hide
Jobs which fail during initalization are never removed from Capacity Schedulers waiting job list.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Sreekanth Ramakrishnan added a comment - 22/Apr/09 07:43 AM
Currently job initialization is done in JobInitalizationPoller in the poller when an Exception is thrown while doing JobInProgress.initTasks() it does a JobInProgress.fail() but the fail does not inform all the job in progress listeners, resulting in job not being removed from the waiting job list of scheduler.

Sreekanth Ramakrishnan added a comment - 22/Apr/09 08:28 AM
Attaching a fix which addresses this issue. Alongwith test case which tests the condition.

Sreekanth Ramakrishnan added a comment - 05/May/09 08:40 AM
The result of ant test-patch is :
     [exec] 
     [exec] 
     [exec] 
     [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.

Vinod K V added a comment - 05/May/09 09:58 AM
Few comments:
  • I think a better place for job removal from the JobQueuesManager is the cleanUpInitializedJobsList() method of teh JobInitializationPoller. We may want to rename this method and change its javadoc a bit.
  • We don't need the null check before job.fail() in the initialization-poller
  • Test-case doesn't compile(perhaps because of HADOOP-5726). Need to change the signature of FakeQueueInfo
  • In the test-case, the statements depicting the error scenarios have to be reversed. For e.g. at TestCapacityScheduler.java +2045
    assertFalse("Waiting job does not contain submitted job",  mgr.getWaitingJobs("default").contains(job));
    should instead be
    assertFalse("Waiting job contains submitted job", mgr.getWaitingJobs("default").contains(job));

Sreekanth Ramakrishnan added a comment - 05/May/09 11:59 AM
Attaching patch incorporating most of Vinod's comments.

I think a better place for job removal from the JobQueuesManager is the cleanUpInitializedJobsList() method of teh JobInitializationPoller. We may want to rename this method and change its javadoc a bit.

This has not been incorporated because of issue described in HADOOP-5020 it is hit when JobInProgress.initTasks() throws an exception and terminate job is called and Capacity scheduler would never be able to remove the job from the waiting queue.

Also added a new test case TestJobInitalizationPoller which uses MiniMRCluster to verify if jobs failing initialization are actually removed from waiting queue.


Sreekanth Ramakrishnan added a comment - 05/May/09 12:06 PM
Removing an unused import statement from JobInitializationPoller

Sreekanth Ramakrishnan added a comment - 05/May/09 01:12 PM
Attaching patch incorporating Vinod's offline comments:
  • Adding apache license to new test case file.
  • Changed the name of new test case from TestJobInitializationPoller to TestJobInitialization.
  • Corrected Typo of initialization the TestCapacityScheduler.
  • Removed unnecessary whitespaces.
  • Removed unused variable from TestJobInitialization
  • Removed unused import from TestCapacityScheduler

Vinod K V added a comment - 05/May/09 01:24 PM
+1 for the patch.

Sreekanth Ramakrishnan added a comment - 05/May/09 01:38 PM
Output from ant test-patch:
     [exec]
     [exec]
     [exec] +1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
     [exec]

Sreekanth Ramakrishnan added a comment - 05/May/09 01:42 PM
Attaching patch for branch 20.

Sreekanth Ramakrishnan added a comment - 05/May/09 05:09 PM
Attaching patch incorporating Hemanth's offline comments.

In TestJobInitialization

  • Changed the scheduler property from guaranteed-capacity to capacity.
  • Asserting if the submitted job is failed.

Attaching both 20 branch and trunk patch


Hemanth Yamijala added a comment - 05/May/09 05:37 PM
Changes look fine to me. +1.

Sreekanth Ramakrishnan added a comment - 05/May/09 06:01 PM
Output for ant test-patch for the latest attachment is as follows:
     [exec]
     [exec] +1 overall.
     [exec]
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec]
     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.
     [exec]
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec]
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec]
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec]
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec]
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
     [exec]

Hemanth Yamijala added a comment - 06/May/09 05:46 AM
I just committed this to trunk and branch 0.20. Thanks, Sreekanth !

Hudson added a comment - 07/May/09 01:19 PM
Integrated in Hadoop-trunk #828 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/828/)
. Forgot to add a new file in the previous commit.
. Remove jobs that failed initialization from the waiting queue in the capacity scheduler. Contributed by Sreekanth Ramakrishnan.