Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-5169

Job recovery fails if job tracker is restarted after the job is submitted but before its initialized

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.2.0
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      This was noticed when within 5 seconds of submitting a word count job, the job tracker was restarted. Upon restart the job failed to recover

      1. MAPREDUCE-5169.patch
        13 kB
        Arun C Murthy
      2. MAPREDUCE-5169.patch
        7 kB
        Arun C Murthy

        Activity

        Hide
        Matt Foley added a comment -

        Closed upon release of Hadoop 1.2.0.

        Show
        Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.
        Hide
        Matt Foley added a comment -

        added 1.2.0 to fixVersion per CHANGES.txt and above comments.

        Show
        Matt Foley added a comment - added 1.2.0 to fixVersion per CHANGES.txt and above comments.
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks for checking on this Arpit Gupta!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks for checking on this Arpit Gupta !
        Hide
        Arpit Gupta added a comment -

        +1

        Show
        Arpit Gupta added a comment - +1
        Hide
        Arpit Gupta added a comment -

        Took the latest patch and ran through the tests and they worked. One of the tests that would fail consistently is if there were pending jobs in the system, after restart they were not recovered. With this patch those jobs recovered successfully.

        Show
        Arpit Gupta added a comment - Took the latest patch and ran through the tests and they worked. One of the tests that would fail consistently is if there were pending jobs in the system, after restart they were not recovered. With this patch those jobs recovered successfully.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12580214/MAPREDUCE-5169.patch
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3544//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580214/MAPREDUCE-5169.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3544//console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        Updated patch with a unit test.

        Show
        Arun C Murthy added a comment - Updated patch with a unit test.
        Hide
        Arun C Murthy added a comment -

        Arpit Gupta thanks for helping narrow this down!

        Show
        Arun C Murthy added a comment - Arpit Gupta thanks for helping narrow this down!
        Hide
        Arun C Murthy added a comment -

        Straight-fwd fix: generate/store jobToken at job-submit along-with job-info file.

        Fixed one other bug alongside: job-info file shouldn't be written to HDFS under JT lock.

        Show
        Arun C Murthy added a comment - Straight-fwd fix: generate/store jobToken at job-submit along-with job-info file. Fixed one other bug alongside: job-info file shouldn't be written to HDFS under JT lock.
        Hide
        Arpit Gupta added a comment - - edited

        Client Logs

        /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop/hadoop-examples-1.X.Y.jar wordcount "-Dmapreduce.reduce.input.limit=-1" /user/hrt_qa/test_mapred_ha/medium_wordcount_input /user/hrt_qa/test_mapred_ha/jobtracker-near-submit_wc_output
        13/04/17 22:04:10 INFO input.FileInputFormat: Total input paths to process : 20
        13/04/17 22:04:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
        13/04/17 22:04:10 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3]
        13/04/17 22:04:10 WARN snappy.LoadSnappy: Snappy native library is available
        13/04/17 22:04:10 INFO util.NativeCodeLoader: Loaded the native-hadoop library
        13/04/17 22:04:10 INFO snappy.LoadSnappy: Snappy native library loaded
        13/04/17 22:04:11 INFO mapred.JobClient: Running job: job_201304172049_0008
        13/04/17 22:04:12 INFO mapred.JobClient:  map 0% reduce 0%
        13/04/17 22:04:32 INFO ipc.Client: Retrying connect to server: host:50300. Already tried 0 time(s); retry policy is MultipleLinearRandomRetry[6x10000ms, 10x60000ms]
        java.io.IOException: The job appears to have been removed.
        at org.apache.hadoop.mapred.JobClient$NetworkedJob.updateStatus(JobClient.java:241)
        at org.apache.hadoop.mapred.JobClient$NetworkedJob.isComplete(JobClient.java:321)
        at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1382)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:583)
        at org.apache.hadoop.examples.WordCount.main(WordCount.java:82)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
        

        Jobtracker logs

        2013-04-17 22:04:11,305 INFO org.apache.hadoop.mapred.JobInProgress: job_201304172049_0008: nMaps=180 nReduces=1 max=-1
        2013-04-17 22:04:11,357 INFO org.apache.hadoop.mapred.JobQueuesManager: Job job_201304172049_0008 submitted to queue default
        2013-04-17 22:04:11,357 INFO org.apache.hadoop.mapred.JobTracker: Job job_201304172049_0008 added successfully for user 'hrt_qa' to queue 'default'
        2013-04-17 22:04:14,268 INFO org.apache.hadoop.mapred.JobInitializationPoller: Passing to Initializer Job Id :job_201304172049_0008 User: user Queue : default
        2013-04-17 22:04:15,181 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG:
        /************************************************************
        SHUTDOWN_MSG: Shutting down JobTracker at 
        ************************************************************/
        2013-04-17 22:04:15,204 INFO org.apache.hadoop.mapred.JobInitializationPoller: Initializing job : job_201304172049_0008 in Queue default For user : user
        2013-04-17 22:04:15,204 INFO org.apache.hadoop.mapred.JobTracker: Initializing job_201304172049_0008
        2013-04-17 22:04:15,204 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_201304172049_0008
        2013-04-17 22:04:30,900 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG:
        /************************************************************
        STARTUP_MSG: Starting JobTracker
        STARTUP_MSG:   host = 
        STARTUP_MSG:   args = []
        STARTUP_MSG:   version = 
        STARTUP_MSG:   build = 
        STARTUP_MSG:   java = 1.6.0_31
        ************************************************************/
        ...
        ...
        ...
        2013-04-17 22:04:31,862 WARN org.apache.hadoop.mapred.JobTracker: Job job_201304172049_0008 does not have valid info/token file so ignoring for recovery
        
        Show
        Arpit Gupta added a comment - - edited Client Logs /usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop/hadoop-examples-1.X.Y.jar wordcount "-Dmapreduce.reduce.input.limit=-1" /user/hrt_qa/test_mapred_ha/medium_wordcount_input /user/hrt_qa/test_mapred_ha/jobtracker-near-submit_wc_output 13/04/17 22:04:10 INFO input.FileInputFormat: Total input paths to process : 20 13/04/17 22:04:10 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 13/04/17 22:04:10 INFO lzo.LzoCodec: Successfully loaded & initialized native -lzo library [hadoop-lzo rev cf4e7cbf8ed0f0622504d008101c2729dc0c9ff3] 13/04/17 22:04:10 WARN snappy.LoadSnappy: Snappy native library is available 13/04/17 22:04:10 INFO util.NativeCodeLoader: Loaded the native -hadoop library 13/04/17 22:04:10 INFO snappy.LoadSnappy: Snappy native library loaded 13/04/17 22:04:11 INFO mapred.JobClient: Running job: job_201304172049_0008 13/04/17 22:04:12 INFO mapred.JobClient: map 0% reduce 0% 13/04/17 22:04:32 INFO ipc.Client: Retrying connect to server: host:50300. Already tried 0 time(s); retry policy is MultipleLinearRandomRetry[6x10000ms, 10x60000ms] java.io.IOException: The job appears to have been removed. at org.apache.hadoop.mapred.JobClient$NetworkedJob.updateStatus(JobClient.java:241) at org.apache.hadoop.mapred.JobClient$NetworkedJob.isComplete(JobClient.java:321) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1382) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:583) at org.apache.hadoop.examples.WordCount.main(WordCount.java:82) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) Jobtracker logs 2013-04-17 22:04:11,305 INFO org.apache.hadoop.mapred.JobInProgress: job_201304172049_0008: nMaps=180 nReduces=1 max=-1 2013-04-17 22:04:11,357 INFO org.apache.hadoop.mapred.JobQueuesManager: Job job_201304172049_0008 submitted to queue default 2013-04-17 22:04:11,357 INFO org.apache.hadoop.mapred.JobTracker: Job job_201304172049_0008 added successfully for user 'hrt_qa' to queue ' default ' 2013-04-17 22:04:14,268 INFO org.apache.hadoop.mapred.JobInitializationPoller: Passing to Initializer Job Id :job_201304172049_0008 User: user Queue : default 2013-04-17 22:04:15,181 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down JobTracker at ************************************************************/ 2013-04-17 22:04:15,204 INFO org.apache.hadoop.mapred.JobInitializationPoller: Initializing job : job_201304172049_0008 in Queue default For user : user 2013-04-17 22:04:15,204 INFO org.apache.hadoop.mapred.JobTracker: Initializing job_201304172049_0008 2013-04-17 22:04:15,204 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_201304172049_0008 2013-04-17 22:04:30,900 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = STARTUP_MSG: args = [] STARTUP_MSG: version = STARTUP_MSG: build = STARTUP_MSG: java = 1.6.0_31 ************************************************************/ ... ... ... 2013-04-17 22:04:31,862 WARN org.apache.hadoop.mapred.JobTracker: Job job_201304172049_0008 does not have valid info/token file so ignoring for recovery

          People

          • Assignee:
            Arun C Murthy
            Reporter:
            Arpit Gupta
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development