Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4770

Hadoop jobs failing with FileNotFound Exception while the job is still running

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.20.203.0
    • Fix Version/s: None
    • Component/s: tasktracker
    • Labels:
      None

      Description

      We are having a strange issue in our Hadoop cluster. We have noticed that some of our jobs fail with the with a file not found exception[see below]. Basically the files in the "attempt_*" directory and the directory itself are getting deleted while the task is still being run on the host. Looking through some of the hadoop documentation I see that the job directory gets wiped out when it gets a KillJobAction however I am not sure why it gets wiped out while the job is still running.

      My question is what could be deleting it while the job is running? Any thoughts or pointers on how to debug this would be helpful.

      Thanks!

      java.io.FileNotFoundException: /hadoop/mapred/local_data/taskTracker//jobcache/job_201211030344_15383/attempt_201211030344_15383_m_000169_0/output/spill29.out (Permission denied) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:120) at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.(RawLocalFileSystem.java:71) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.(RawLocalFileSystem.java:107) at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:177) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:400) at org.apache.hadoop.mapred.Merger$Segment.init(Merger.java:205) at org.apache.hadoop.mapred.Merger$Segment.access$100(Merger.java:165) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:418) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381) at org.apache.hadoop.mapred.Merger.merge(Merger.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1692) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253)

        Activity

        Hide
        Arun A K added a comment -

        Not sure if this could be the solution -

        IsolationRunner is a utility to help debug MapReduce programs.

        To use the IsolationRunner, first set keep.failed.task.files to true (also see keep.task.files.pattern).

        Next, go to the node on which the failed task ran and go to the TaskTracker's local directory and run the IsolationRunner:
        $ cd <local path>/taskTracker/$

        {taskid}

        /work
        $ bin/hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml

        IsolationRunner will run the failed task in a single jvm, which can be in the debugger, over precisely the same input.

        Note that currently IsolationRunner will only re-run map tasks.

        Reference : http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html

        Show
        Arun A K added a comment - Not sure if this could be the solution - IsolationRunner is a utility to help debug MapReduce programs. To use the IsolationRunner, first set keep.failed.task.files to true (also see keep.task.files.pattern). Next, go to the node on which the failed task ran and go to the TaskTracker's local directory and run the IsolationRunner: $ cd <local path>/taskTracker/$ {taskid} /work $ bin/hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml IsolationRunner will run the failed task in a single jvm, which can be in the debugger, over precisely the same input. Note that currently IsolationRunner will only re-run map tasks. Reference : http://hadoop.apache.org/docs/r1.1.1/mapred_tutorial.html

          People

          • Assignee:
            Unassigned
            Reporter:
            Jaikannan Ramamoorthy
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development