Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.204.0
    • Fix Version/s: None
    • Component/s: tasktracker
    • Labels:
      None
    • Target Version/s:

      Description

      TaskInProgress#kill in the JT fails because TaskStatus#setFinishTimes fails because no start time was set. There's no start time because TaskTracker#run (DefaultTaskController#initializeJob) failed before it was set. The fix is to have TT#launchTask set the start time before it starts the task runner, this way there's a valid start time even if TT#run fails.

        Activity

        Matt Foley made changes -
        Target Version/s 1.2.0 [ 12321661 ] 1.3.0 [ 12324153 ]
        Hide
        Matt Foley added a comment -

        Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.

        Show
        Matt Foley added a comment - Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.
        Matt Foley made changes -
        Target Version/s 1.1.0 [ 12317960 ] 1.2.0 [ 12321661 ]
        Hide
        Matt Foley added a comment -

        Moved to 1.2.0 upon release of 1.1.0.

        Show
        Matt Foley added a comment - Moved to 1.2.0 upon release of 1.1.0.
        Hide
        Eli Collins added a comment -

        You're right - the last comment is bogus (the JT was on a RO fs).

        The earlier ones however are from just TTs running on loop-back mounts with faults injected, and the JT was fine. On the 1st it looks like the issue is that the JobClient doesn't handle errors getting task output, or when TT exceptions get plumbed back up to it. Though perhaps per MAPREDUCE-3473 this is expected behavior given that *.failures.maxpercent defaults to 0.

        Show
        Eli Collins added a comment - You're right - the last comment is bogus (the JT was on a RO fs). The earlier ones however are from just TTs running on loop-back mounts with faults injected, and the JT was fine. On the 1st it looks like the issue is that the JobClient doesn't handle errors getting task output, or when TT exceptions get plumbed back up to it. Though perhaps per MAPREDUCE-3473 this is expected behavior given that *.failures.maxpercent defaults to 0.
        Hide
        Todd Lipcon added a comment -

        This looks like it failed because of a failed disk on the JT rather than the TT:
        ...
        at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1203)
        at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:415)
        ...

        are you running the JT on a machine that has bad disks simulated?

        Show
        Todd Lipcon added a comment - This looks like it failed because of a failed disk on the JT rather than the TT: ... at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1203) at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:415) ... are you running the JT on a machine that has bad disks simulated?
        Hide
        Eli Collins added a comment -

        Ditto for Job submission with a read-only file system. Subsequent submissions succeed.

        Starting Job
        11/11/11 14:27:26 INFO mapred.FileInputFormat: Total input paths to process : 2
        11/11/11 14:27:26 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/user/eli/mapred/staging/eli/.staging/job_201111111426_0001
        org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.FileNotFoundException: /mnt/read-only/jobTracker/job_201111111426_0001.xml (Read-only file system)
        	at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
        	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:396)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
        Caused by: java.io.FileNotFoundException: /mnt/read-only/jobTracker/job_201111111426_0001.xml (Read-only file system)
        	at java.io.FileOutputStream.open(Native Method)
        	at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
        	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:188)
        	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:184)
        	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:255)
        	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:236)
        	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:335)
        	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:381)
        	at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364)
        	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
        	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536)
        	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443)
        	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:229)
        	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163)
        	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1222)
        	at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1203)
        	at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:415)
        	at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
        	... 11 more
        
        	at org.apache.hadoop.ipc.Client.call(Client.java:1066)
        	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
        	at org.apache.hadoop.mapred.$Proxy2.submitJob(Unknown Source)
        	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:913)
        	at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at javax.security.auth.Subject.doAs(Subject.java:396)
        	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842)
        	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:816)
        	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1253)
        	at org.apache.hadoop.examples.PiEstimator.estimate(PiEstimator.java:297)
        	at org.apache.hadoop.examples.PiEstimator.run(PiEstimator.java:342)
        	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        	at org.apache.hadoop.examples.PiEstimator.main(PiEstimator.java:351)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav	at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        	at java.lang.reflect.Method.invoke(Method.java:597)
        	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
        
        Show
        Eli Collins added a comment - Ditto for Job submission with a read-only file system. Subsequent submissions succeed. Starting Job 11/11/11 14:27:26 INFO mapred.FileInputFormat: Total input paths to process : 2 11/11/11 14:27:26 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/user/eli/mapred/staging/eli/.staging/job_201111111426_0001 org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.FileNotFoundException: /mnt/read-only/jobTracker/job_201111111426_0001.xml (Read-only file system) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) Caused by: java.io.FileNotFoundException: /mnt/read-only/jobTracker/job_201111111426_0001.xml (Read-only file system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:179) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:188) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:184) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:255) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:236) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:335) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:381) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:229) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:163) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1222) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1203) at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:415) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941) ... 11 more at org.apache.hadoop.ipc.Client.call(Client.java:1066) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at org.apache.hadoop.mapred.$Proxy2.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:913) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:816) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1253) at org.apache.hadoop.examples.PiEstimator.estimate(PiEstimator.java:297) at org.apache.hadoop.examples.PiEstimator.run(PiEstimator.java:342) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.PiEstimator.main(PiEstimator.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
        Hide
        Eli Collins added a comment -

        This affects the LTC as well. The log claims the job completed, however it was a wordcount job and it never populated the output directory.

        11/11/03 12:51:56 WARN mapred.JobClient: Error reading task outputhttp://eli-thi
        nkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_
        1&filter=stdout
        11/11/03 12:51:56 WARN mapred.JobClient: Error reading task outputhttp://eli-thi
        nkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_
        1&filter=stderr
        11/11/03 12:51:59 INFO mapred.JobClient: Task Id : attempt_201111031152_0006_m_0
        00001_2, Status : FAILED
        Error initializing attempt_201111031152_0006_m_000001_2:
        java.io.IOException: Job initialization failed (255)
                at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskC
        ontroller.java:192)
                at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1231)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:396)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
        tion.java:1059)
                at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1
        206)
                at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:112
        1)
                at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2410)
                at java.lang.Thread.run(Thread.java:662)
        Caused by: org.apache.hadoop.util.Shell$ExitCodeException: 
                at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
                at org.apache.hadoop.util.Shell.run(Shell.java:182)
                at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
        375)
                at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskC
        ontroller.java:185)
                ... 8 more
        
        11/11/03 12:51:59 WARN mapred.JobClient: Error reading task outputhttp://eli-thinkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_2&filter=stdout
        11/11/03 12:51:59 WARN mapred.JobClient: Error reading task outputhttp://eli-thinkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_2&filter=stderr
        11/11/03 12:52:02 INFO mapred.JobClient: Job complete: job_201111031152_0006
        11/11/03 12:52:02 INFO mapred.JobClient: Counters: 4
        11/11/03 12:52:02 INFO mapred.JobClient:   Job Counters 
        11/11/03 12:52:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=0
        11/11/03 12:52:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
        11/11/03 12:52:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
        11/11/03 12:52:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
        
        Show
        Eli Collins added a comment - This affects the LTC as well. The log claims the job completed, however it was a wordcount job and it never populated the output directory. 11/11/03 12:51:56 WARN mapred.JobClient: Error reading task outputhttp://eli-thi nkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_ 1&filter=stdout 11/11/03 12:51:56 WARN mapred.JobClient: Error reading task outputhttp://eli-thi nkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_ 1&filter=stderr 11/11/03 12:51:59 INFO mapred.JobClient: Task Id : attempt_201111031152_0006_m_0 00001_2, Status : FAILED Error initializing attempt_201111031152_0006_m_000001_2: java.io.IOException: Job initialization failed (255) at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskC ontroller.java:192) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1231) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1059) at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1 206) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:112 1) at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2410) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java: 375) at org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskC ontroller.java:185) ... 8 more 11/11/03 12:51:59 WARN mapred.JobClient: Error reading task outputhttp://eli-thinkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_2&filter=stdout 11/11/03 12:51:59 WARN mapred.JobClient: Error reading task outputhttp://eli-thinkpad:50316/tasklog?plaintext=true&attemptid=attempt_201111031152_0006_m_000001_2&filter=stderr 11/11/03 12:52:02 INFO mapred.JobClient: Job complete: job_201111031152_0006 11/11/03 12:52:02 INFO mapred.JobClient: Counters: 4 11/11/03 12:52:02 INFO mapred.JobClient: Job Counters 11/11/03 12:52:02 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=0 11/11/03 12:52:02 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/11/03 12:52:02 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/11/03 12:52:02 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
        Matt Foley made changes -
        Field Original Value New Value
        Fix Version/s 0.20.205.0 [ 12316391 ]
        Target Version/s 0.20.206.0 [ 12317960 ]
        Hide
        Eli Collins added a comment -

        Just fixing the setTimes issue doesn't fix the total issue. TaskRunner#run keeps failing to create the log dirs because it hasn't yet removed the failing dir from good dirs. I'm not sure why the job submission hangs, it should try to run the task on one of the other TTs (there are two others running in my case).

        java.lang.Throwable: Child Error
        	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
        Caused by: java.io.IOException: Creation of /faulty-disk/dir1/userlogs/job_201109081514_0002/attempt_201109081514_0002_m_000000_1 failed.
        	at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:102)
        	at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71)
        	at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316)
        	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228)
        
        -------
        java.lang.Throwable: Child Error
        	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
        Caused by: java.io.IOException: Task process exit with nonzero status of -1.
        	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
        
        Show
        Eli Collins added a comment - Just fixing the setTimes issue doesn't fix the total issue. TaskRunner#run keeps failing to create the log dirs because it hasn't yet removed the failing dir from good dirs. I'm not sure why the job submission hangs, it should try to run the task on one of the other TTs (there are two others running in my case). java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Creation of /faulty-disk/dir1/userlogs/job_201109081514_0002/attempt_201109081514_0002_m_000000_1 failed. at org.apache.hadoop.mapred.TaskLog.createTaskAttemptLogDir(TaskLog.java:102) at org.apache.hadoop.mapred.DefaultTaskController.createLogDir(DefaultTaskController.java:71) at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:316) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:228) ------- java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of -1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
        Eli Collins created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development