Hadoop Common
  1. Hadoop Common
  2. HADOOP-5636

Job is left in Running state after a killJob

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In one scenario, Job was left in Running state forever, when a kill was issued after launching job setup task.

        Activity

        Hide
        Amareshwari Sriramadasu added a comment -

        Job has 60 maps and 50 reduces.
        JobTracker log for the job :

        06:40:08,409 INFO org.apache.hadoop.mapred.JobHistory: Deleting job history file xyz
        06:40:11,621 INFO org.apache.hadoop.mapred.JobTracker: Restoration complete
        06:40:11,694 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_m_000061_0' to tip task_200903310541_9080_m_000061, for tracker 'xxx'
        06:40:11,737 INFO org.apache.hadoop.mapred.JobInProgress: Killingjob 'job_200903310541_9080'
        06:40:11,748 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_m_000060_0' to tip task_200903310541_9080_m_000060, for tracker 'xxxx'
        06:40:11,750 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000035_0' to tip task_200903310541_9080_r_000035, for tracker 'xxxxxx'
        06:40:11,803 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000047_0' to tip task_200903310541_9080_r_000047, for tracker 'xxxxxxxx'
        .
        .
        .
        all reducers are launched.
        06:40:36,568 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200903310541_9080_m_000060_0' has completed task_200903310541_9080_m_000060 successfully.
        06:40:41,980 INFO org.apache.hadoop.mapred.JobHistory: Recovered job history filename for job job_200903310541_9080 is xyz
        06:40:41,981 INFO org.apache.hadoop.mapred.JobHistory: Renaming xyz.recover to xyz
        06:40:42,001 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_m_000060_0' from 'xxx'
        06:40:42,061 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000035_0' from 'xxxx'
        06:40:42,073 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000047_0' from 'xxxxx'
        06:40:42,256 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200903310541_9080_m_000061_0' has completed task_200903310541_9080_m_000061 successfully.
        06:40:42,263 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000002_0' to tip task_200903310541_9080_r_000002, for tracker xxx
        06:41:26,579 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000002_0' from xxxx
        06:40:42,271 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000043_0' from xxx
        06:40:42,338 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000010_1' from xxxx
        .
        .
        .

        Show
        Amareshwari Sriramadasu added a comment - Job has 60 maps and 50 reduces. JobTracker log for the job : 06:40:08,409 INFO org.apache.hadoop.mapred.JobHistory: Deleting job history file xyz 06:40:11,621 INFO org.apache.hadoop.mapred.JobTracker: Restoration complete 06:40:11,694 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_m_000061_0' to tip task_200903310541_9080_m_000061, for tracker 'xxx' 06:40:11,737 INFO org.apache.hadoop.mapred.JobInProgress: Killingjob 'job_200903310541_9080' 06:40:11,748 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_m_000060_0' to tip task_200903310541_9080_m_000060, for tracker 'xxxx' 06:40:11,750 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000035_0' to tip task_200903310541_9080_r_000035, for tracker 'xxxxxx' 06:40:11,803 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000047_0' to tip task_200903310541_9080_r_000047, for tracker 'xxxxxxxx' . . . all reducers are launched. 06:40:36,568 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200903310541_9080_m_000060_0' has completed task_200903310541_9080_m_000060 successfully. 06:40:41,980 INFO org.apache.hadoop.mapred.JobHistory: Recovered job history filename for job job_200903310541_9080 is xyz 06:40:41,981 INFO org.apache.hadoop.mapred.JobHistory: Renaming xyz.recover to xyz 06:40:42,001 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_m_000060_0' from 'xxx' 06:40:42,061 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000035_0' from 'xxxx' 06:40:42,073 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000047_0' from 'xxxxx' 06:40:42,256 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_200903310541_9080_m_000061_0' has completed task_200903310541_9080_m_000061 successfully. 06:40:42,263 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903310541_9080_r_000002_0' to tip task_200903310541_9080_r_000002, for tracker xxx 06:41:26,579 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000002_0' from xxxx 06:40:42,271 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000043_0' from xxx 06:40:42,338 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903310541_9080_r_000010_1' from xxxx . . .
        Hide
        Amar Kamat added a comment -

        Attaching a patch the fixes this issue by moving a job to running state upon a setup success only if the job is in prep state. Result of test-patch

        [exec] -1 overall.  
             [exec] 
             [exec]     +1 @author.  The patch does not contain any @author tags.
             [exec] 
             [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
             [exec]                         Please justify why no tests are needed for this patch.
             [exec] 
             [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
             [exec] 
             [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
             [exec] 
             [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
             [exec] 
             [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
             [exec] 
             [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
        

        Running ant test now.

        Show
        Amar Kamat added a comment - Attaching a patch the fixes this issue by moving a job to running state upon a setup success only if the job is in prep state. Result of test-patch [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. Running ant test now.
        Hide
        Amar Kamat added a comment -

        Ant tests passed on my box.

        Show
        Amar Kamat added a comment - Ant tests passed on my box.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Amar!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Amar!
        Hide
        Nigel Daley added a comment -

        Devaraj,

        [exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
        [exec] Please justify why no tests are needed for this patch.

        Why did you commit this without a test or justification?

        Show
        Nigel Daley added a comment - Devaraj, [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. Why did you commit this without a test or justification?
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk #830 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/830/)
        . Prevents a job from going to RUNNING state after it has been KILLED (this used to happen when the SetupTask would come back with a success after the job has been killed). Contributed by Amar Kamat.

        Show
        Hudson added a comment - Integrated in Hadoop-trunk #830 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/830/ ) . Prevents a job from going to RUNNING state after it has been KILLED (this used to happen when the SetupTask would come back with a success after the job has been killed). Contributed by Amar Kamat.
        Hide
        Amar Kamat added a comment -

        Nigel,
        Its not easy to write a test case for this. The situation is something like this :

        1. jobtracker schedules a setup task for a job
        2. user issues a job kill and the job is marked for cleanup
        3. cleanup returns and the job is marked killed
        4. setup returns at the same time and moves the job to running state

        The only hard part is to make the tracker with the setup return at the same time.

        Show
        Amar Kamat added a comment - Nigel, Its not easy to write a test case for this. The situation is something like this : jobtracker schedules a setup task for a job user issues a job kill and the job is marked for cleanup cleanup returns and the job is marked killed setup returns at the same time and moves the job to running state The only hard part is to make the tracker with the setup return at the same time.
        Hide
        Nigel Daley added a comment -

        Amar, so did manually test this or not test this fix? If manually tested, can you describe the manual test?

        Show
        Nigel Daley added a comment - Amar, so did manually test this or not test this fix? If manually tested, can you describe the manual test?
        Hide
        Amar Kamat added a comment -

        @Nigel : Karam tested this patch.
        @Karam : can you please describe how you tested this patch?

        Show
        Amar Kamat added a comment - @Nigel : Karam tested this patch. @Karam : can you please describe how you tested this patch?
        Hide
        Karam Singh added a comment -

        Submitted a job whose setup task run 3 mins.
        When Setup task of jobs is running, go to TT on which setup task is running and suspend TT process.
        Issue hadoop job -kill
        Checked that job is moved killed state
        Resume TT (TT is process should be resumed at time setup task is complete ).
        Without 5636 patch applied -:
        Job is switched to running state. Job is not removed from capacity scheduler queue.
        When can see NullPointerException in JobTracker log on assignTask. No new job is scheduled

        With Job patch -:
        Job state does not change and Job is removed from capacity-scheduler queue.
        No NPE in JobTracker log, and other jobs are getting scheduled.

        Show
        Karam Singh added a comment - Submitted a job whose setup task run 3 mins. When Setup task of jobs is running, go to TT on which setup task is running and suspend TT process. Issue hadoop job -kill Checked that job is moved killed state Resume TT (TT is process should be resumed at time setup task is complete ). Without 5636 patch applied -: Job is switched to running state. Job is not removed from capacity scheduler queue. When can see NullPointerException in JobTracker log on assignTask. No new job is scheduled With Job patch -: Job state does not change and Job is removed from capacity-scheduler queue. No NPE in JobTracker log, and other jobs are getting scheduled.

          People

          • Assignee:
            Amar Kamat
            Reporter:
            Amareshwari Sriramadasu
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development