Issue Details (XML | Word | Printable)

Key: HADOOP-5924
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: Amar Kamat
Reporter: Ramya R
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

JT fails to recover the jobs after restart after HADOOP:4372

Created: 27/May/09 11:19 AM   Updated: 10/Jun/09 06:05 AM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.20.1

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works H-5924.20.patch 2009-06-08 09:23 PM Robert Chansler 0.7 kB
Text File Licensed for inclusion in ASF works HADOOP-5923-v2.4.patch 2009-05-30 02:19 AM Amar Kamat 9 kB
Text File Licensed for inclusion in ASF works HADOOP-5924-v1.0-branch20.patch 2009-06-02 08:10 AM Amar Kamat 9 kB
Text File Licensed for inclusion in ASF works HADOOP-5924-v1.0.patch 2009-06-01 08:31 AM Amar Kamat 9 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Release Note:
Post HADOOP-4372, empty job history files caused NPE. This issues fixes that by creating new files if no old file is found.
Resolution Date: 03/Jun/09 03:53 AM


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Ramya R added a comment - 27/May/09 11:29 AM
Submitted a job and restarted the JT after sometime. Below is the snapshot of the JT log:
INFO org.apache.hadoop.mapred.JobTracker: Submitting job <jobID> on behalf of user <user> in groups :<group>
INFO org.apache.hadoop.mapred.JobHistory: Recovered job history filename for job <jobID> is <job history file>
INFO org.apache.hadoop.mapred.JobHistory:  <job history file> exists!
INFO org.apache.hadoop.mapred.JobHistory: <job history file> exists!
INFO org.apache.hadoop.mapred.JobQueuesManager: Job submitted to queue default
WARN org.apache.hadoop.fs.FSInputChecker: Problem opening checksum file: file:<logs>history/<job history file>
Ignoring exception: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:134)
        at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:351)
        at org.apache.hadoop.mapred.JobHistory.parseHistoryFromFS(JobHistory.java:254)
        at org.apache.hadoop.mapred.JobTracker$RecoveryManager.recover(JobTracker.java:1361)
        at org.apache.hadoop.mapred.JobTracker.offerService(JobTracker.java:1850)
        at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3695)
INFO org.apache.hadoop.mapred.JobHistory: Deleting job history file <job history file>
INFO org.apache.hadoop.mapred.JobTracker: Restoration complete
INFO org.apache.hadoop.mapred.JobInitializationPoller: Passing to Initializer Job Id :<jobID> User:<user> Queue : default
INFO org.apache.hadoop.mapred.JobInitializationPoller: Initializing job : <jobID> in Queue default For user : <user>
INFO org.apache.hadoop.mapred.JobInProgress: Initializing <jobID>
INFO org.apache.hadoop.mapred.JobHistory: Nothing to recover for job <jobID>
INFO org.apache.hadoop.mapred.JobInitializationPoller: Job initialization failed:
java.lang.IllegalArgumentException: Can not create a Path from a null string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:78)
        at org.apache.hadoop.fs.Path.<init>(Path.java:90)
        at org.apache.hadoop.fs.Path.<init>(Path.java:45)
        at org.apache.hadoop.mapred.JobHistory$JobInfo.getJobHistoryLogLocation(JobHistory.java:577)
        at org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:871)
        at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:405)
        at org.apache.hadoop.mapred.JobInitializationPoller$JobInitializationThread.initializeJobs(JobInitializationPoller.java:143)
        at org.apache.hadoop.mapred.JobInitializationPoller$JobInitializationThread.run(JobInitializationPoller.java:113)
INFO org.apache.hadoop.mapred.JobHistory: Nothing to recover for job <jobID>
INFO org.apache.hadoop.mapred.JobInitializationPoller: Removing killed/completed job from initalized jobs list : <jobID>

The job fails to recover and is marked as failed. This happens for all the jobs(irrespective of map/reduce progress)


Amar Kamat added a comment - 30/May/09 02:19 AM
Attaching a patch that fixes the issue. Result of test-patch
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.

Running ant test now.


Amar Kamat added a comment - 30/May/09 01:05 PM
ant test passed on my box.

Devaraj Das added a comment - 31/May/09 08:39 AM
Minor nit - the check for an empty killList is redundant and can be removed.

Amar Kamat added a comment - 01/Jun/09 08:31 AM
Attaching a patch incorporating Devaraj's comments. Result of test-patch
[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 9 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.

Note that the patch depends on HADOOP-5908.

Ant tests passed on my box.


Amar Kamat added a comment - 02/Jun/09 08:10 AM
Attaching a patch for 0.20 branch.

Devaraj Das added a comment - 03/Jun/09 03:53 AM
I just committed this. Thanks, Amar!

Robert Chansler added a comment - 08/Jun/09 09:23 PM
Attached an alternate version for 0.20 not to be committed to the branch.