Issue Details (XML | Word | Printable)

Key: HADOOP-5636
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: Amar Kamat
Reporter: Amareshwari Sriramadasu
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Job is left in Running state after a killJob

Created: 07/Apr/09 11:22 AM   Updated: 08/Jul/09 04:53 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.20.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works HADOOP-5636-v1.0.patch 2009-05-08 04:26 AM Amar Kamat 0.9 kB

Hadoop Flags: Reviewed
Resolution Date: 08/May/09 10:22 AM


 Description  « Hide
In one scenario, Job was left in Running state forever, when a kill was issued after launching job setup task.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Amareshwari Sriramadasu added a comment - 07/Apr/09 11:36 AM
Job has 60 maps and 50 reduces.
JobTracker log for the job :

06:40:08,409 INFO org.apache.hadoop.mapred.JobHistory: Deleting job history file xyz
06:40:11,621 INFO org.apache.hadoop.mapred.JobTracker: Restoration complete
06:40:11,694 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_m_000061_0' to tip task_200903310541_9080_m_000061, for tracker 'xxx'
06:40:11,737 INFO org.apache.hadoop.mapred.JobInProgress: Killingjob 'job_200903310541_9080'
06:40:11,748 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_m_000060_0' to tip task_200903310541_9080_m_000060, for tracker 'xxxx'
06:40:11,750 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000035_0' to tip task_200903310541_9080_r_000035, for tracker 'xxxxxx'
06:40:11,803 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000047_0' to tip task_200903310541_9080_r_000047, for tracker 'xxxxxxxx'
.
.
.
all reducers are launched.
06:40:36,568 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200903310541_9080_m_000060_0' has completed task_200903310541_9080_m_000060 successfully.
06:40:41,980 INFO org.apache.hadoop.mapred.JobHistory: Recovered job history filename for job job_200903310541_9080 is xyz
06:40:41,981 INFO org.apache.hadoop.mapred.JobHistory: Renaming xyz.recover to xyz
06:40:42,001 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_m_000060_0' from 'xxx'
06:40:42,061 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000035_0' from 'xxxx'
06:40:42,073 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000047_0' from 'xxxxx'
06:40:42,256 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200903310541_9080_m_000061_0' has completed task_200903310541_9080_m_000061 successfully.
06:40:42,263 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000002_0' to tip task_200903310541_9080_r_000002, for tracker xxx
06:41:26,579 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000002_0' from xxxx
06:40:42,271 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000043_0' from xxx
06:40:42,338 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000010_1' from xxxx
.
.
.


Amar Kamat added a comment - 08/May/09 04:26 AM
Attaching a patch the fixes this issue by moving a job to running state upon a setup success only if the job is in prep state. Result of test-patch
[exec] -1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
     [exec]                         Please justify why no tests are needed for this patch.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.

Running ant test now.


Amar Kamat added a comment - 08/May/09 08:38 AM
Ant tests passed on my box.

Devaraj Das added a comment - 08/May/09 10:22 AM
I just committed this. Thanks, Amar!

Nigel Daley added a comment - 08/May/09 04:18 PM
Devaraj,

[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no tests are needed for this patch.

Why did you commit this without a test or justification?


Hudson added a comment - 08/May/09 07:55 PM
Integrated in Hadoop-trunk #830 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/830/)
. Prevents a job from going to RUNNING state after it has been KILLED (this used to happen when the SetupTask would come back with a success after the job has been killed). Contributed by Amar Kamat.

Amar Kamat added a comment - 09/May/09 01:09 PM
Nigel,
Its not easy to write a test case for this. The situation is something like this :
  1. jobtracker schedules a setup task for a job
  2. user issues a job kill and the job is marked for cleanup
  3. cleanup returns and the job is marked killed
  4. setup returns at the same time and moves the job to running state

The only hard part is to make the tracker with the setup return at the same time.


Nigel Daley added a comment - 15/May/09 04:33 PM
Amar, so did manually test this or not test this fix? If manually tested, can you describe the manual test?

Amar Kamat added a comment - 16/May/09 06:56 AM
@Nigel : Karam tested this patch.
@Karam : can you please describe how you tested this patch?

Karam Singh added a comment - 18/May/09 08:33 AM
Submitted a job whose setup task run 3 mins.
When Setup task of jobs is running, go to TT on which setup task is running and suspend TT process.
Issue hadoop job -kill
Checked that job is moved killed state
Resume TT (TT is process should be resumed at time setup task is complete ).
Without 5636 patch applied -:
Job is switched to running state. Job is not removed from capacity scheduler queue.
When can see NullPointerException in JobTracker log on assignTask. No new job is scheduled

With Job patch -:
Job state does not change and Job is removed from capacity-scheduler queue.
No NPE in JobTracker log, and other jobs are getting scheduled.