Hadoop Common
  1. Hadoop Common
  2. HADOOP-5194

DiskErrorException in TaskTracker when running a job

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Windows, Cygwin

    • Hadoop Flags:
      Reviewed

      Description

      In particular, this can be reproduced in Windows by running a hadoop example such as PiEstimator.

      org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/pids/attempt_200902021632_0001_m_000002_0 in any of the configured local directories
      	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:381)
      	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
      	at org.apache.hadoop.mapred.TaskTracker.getPidFilePath(TaskTracker.java:430)
      	at org.apache.hadoop.mapred.TaskTracker.removePidFile(TaskTracker.java:440)
      	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:370)
      	at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:338)
      

      (Have changed TaskTracker.java to print out the trace.)

      This patch disables usage of setsid and pidfiles on Windows.

      1. HADOOP-5194.patch
        9 kB
        Ravi Gummadi
      2. HADOOP-5194.v1.patch
        9 kB
        Ravi Gummadi

        Activity

        Hide
        Ravi Gummadi added a comment -

        Making it a blocker for 0.21 as this makes almost all jobs on Windows to fail.

        Show
        Ravi Gummadi added a comment - Making it a blocker for 0.21 as this makes almost all jobs on Windows to fail.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Hudson also hit this. See http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/console

         [exec]     [junit] 09/02/09 20:55:12 INFO mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException:
         Could not find taskTracker/jobcache/job_200902092055_0001/attempt_200902092055_0001_m_000004_0/output/file.out in any of the configured local directories
        
        Show
        Tsz Wo Nicholas Sze added a comment - Hudson also hit this. See http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/console [exec] [junit] 09/02/09 20:55:12 INFO mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200902092055_0001/attempt_200902092055_0001_m_000004_0/output/file.out in any of the configured local directories
        Hide
        Ravi Gummadi added a comment -

        The issue you see on Hudson is not related to pid files issue that we see on Windows, Cygwin.

        I guess the issue you see on Hudson is solved by patch for 4963. Would you please confirm the same ?

        Show
        Ravi Gummadi added a comment - The issue you see on Hudson is not related to pid files issue that we see on Windows, Cygwin. I guess the issue you see on Hudson is solved by patch for 4963. Would you please confirm the same ?
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Hudson also hit this. See http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/console

        This must be different and may not even refer to a bug. Will look into that separately.

        As for the original report of task and job failures on Cygwin, there seem to be multiple problems here. The exception that Nicholas reported is appearing on trunk pre HADOOP-4759. After HADOOP-4759 however, that exception is gone because of the changes involved in that issue, though tasks still continue to fail with a "kill unknown jvm" problem.

        09/02/10 18:32:44 INFO mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200902101830_0001_m_000003_0 task's state:UNASSIGNED
        09/02/10 18:32:44 INFO mapred.TaskTracker: Trying to launch : attempt_200902101830_0001_m_000003_0
        09/02/10 18:32:44 INFO mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_200902101830_0001_m_000003_0
        09/02/10 18:32:45 INFO mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_200902101830_0001_m_1163115666
        09/02/10 18:32:45 INFO mapred.JvmManager: JVM Runner jvm_200902101830_0001_m_1163115666 spawned.
        09/02/10 18:32:45 INFO mapred.JvmManager: JVM : jvm_200902101830_0001_m_1163115666 exited. Number of tasks it ran: 0
        09/02/10 18:32:46 INFO mapred.TaskTracker: Killing unknown JVM jvm_200902101830_0001_m_1163115666
        09/02/10 18:32:48 INFO mapred.TaskRunner: attempt_200902101830_0001_m_000003_0 done; removing files.
        
        Show
        Vinod Kumar Vavilapalli added a comment - Hudson also hit this. See http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3820/console This must be different and may not even refer to a bug. Will look into that separately. As for the original report of task and job failures on Cygwin, there seem to be multiple problems here. The exception that Nicholas reported is appearing on trunk pre HADOOP-4759 . After HADOOP-4759 however, that exception is gone because of the changes involved in that issue, though tasks still continue to fail with a "kill unknown jvm" problem. 09/02/10 18:32:44 INFO mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200902101830_0001_m_000003_0 task's state:UNASSIGNED 09/02/10 18:32:44 INFO mapred.TaskTracker: Trying to launch : attempt_200902101830_0001_m_000003_0 09/02/10 18:32:44 INFO mapred.TaskTracker: In TaskLauncher, current free slots : 2 and trying to launch attempt_200902101830_0001_m_000003_0 09/02/10 18:32:45 INFO mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_200902101830_0001_m_1163115666 09/02/10 18:32:45 INFO mapred.JvmManager: JVM Runner jvm_200902101830_0001_m_1163115666 spawned. 09/02/10 18:32:45 INFO mapred.JvmManager: JVM : jvm_200902101830_0001_m_1163115666 exited. Number of tasks it ran: 0 09/02/10 18:32:46 INFO mapred.TaskTracker: Killing unknown JVM jvm_200902101830_0001_m_1163115666 09/02/10 18:32:48 INFO mapred.TaskRunner: attempt_200902101830_0001_m_000003_0 done; removing files.
        Hide
        Ravi Gummadi added a comment -

        When setsid is used when creating a process(HADOOP-2721 does this), on windows, cygwin, process.waitFor() seem to be not working(not waiting for the process completion) if the process is not using input stream and error stream. Can someone think of a solution/workaround for this ?

        Show
        Ravi Gummadi added a comment - When setsid is used when creating a process( HADOOP-2721 does this), on windows, cygwin, process.waitFor() seem to be not working(not waiting for the process completion) if the process is not using input stream and error stream. Can someone think of a solution/workaround for this ?
        Hide
        Ravi Gummadi added a comment -

        I am planning to disable usage of setsid and pidFiles for WINDOWS to resolve this issue.

        Show
        Ravi Gummadi added a comment - I am planning to disable usage of setsid and pidFiles for WINDOWS to resolve this issue.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        +1 for disabling it on WINDOWS as an interim fix.

        Show
        Vinod Kumar Vavilapalli added a comment - +1 for disabling it on WINDOWS as an interim fix.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > I guess the issue you see on Hudson is solved by patch for 4963. Would you please confirm the same ?

        You are right. Hudson does not fail anymore.

        Show
        Tsz Wo Nicholas Sze added a comment - > I guess the issue you see on Hudson is solved by patch for 4963. Would you please confirm the same ? You are right. Hudson does not fail anymore.
        Hide
        Ravi Gummadi added a comment -

        Attaching patch that disables usage of setsid and pidfiles for windows.

        Please review and provide your comments.

        Show
        Ravi Gummadi added a comment - Attaching patch that disables usage of setsid and pidfiles for windows. Please review and provide your comments.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Patch looks good overall to me. Two minor nits.

        • There are ^M characters in the patch.
        • The documentation of mapred.tasktracker.tasks.sleeptime-before-sigkill now reads "The time, in milliseconds, the tasktracker waits for sending a SIGKILL to a process, after it has been sent a SIGTERM. This is not used on WINDOWS now". Can you change it to something like "The time, in milliseconds, the tasktracker waits for sending a SIGKILL to a task, after it has been sent a SIGTERM. This is currently not used on WINDOWS where tasks are just sent a SIGTERM." ?

        Can you upload a patch fixing these?

        Show
        Vinod Kumar Vavilapalli added a comment - Patch looks good overall to me. Two minor nits. There are ^M characters in the patch. The documentation of mapred.tasktracker.tasks.sleeptime-before-sigkill now reads "The time, in milliseconds, the tasktracker waits for sending a SIGKILL to a process, after it has been sent a SIGTERM. This is not used on WINDOWS now" . Can you change it to something like "The time, in milliseconds, the tasktracker waits for sending a SIGKILL to a task, after it has been sent a SIGTERM. This is currently not used on WINDOWS where tasks are just sent a SIGTERM." ? Can you upload a patch fixing these?
        Hide
        Ravi Gummadi added a comment -

        Vinod, Attaching patch with the changes.

        Show
        Ravi Gummadi added a comment - Vinod, Attaching patch with the changes.
        Hide
        Ravi Gummadi added a comment -

        unit tests passed on my linux machine.

        The number of failures in unit tests on Windows reduced from 80(with trunk) to 18(with this patch). Here are the failures with this patch:

        org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint
        org.apache.hadoop.fs.TestDFVariations.testOSParsing
        org.apache.hadoop.mapreduce.TestMapReduceLazyOutput.testLazyOutput
        org.apache.hadoop.cli.TestCLI.testAll
        org.apache.hadoop.fs.TestTrash.testTrash
        org.apache.hadoop.fs.TestTrash.testNonDefaultFS
        org.apache.hadoop.hdfs.TestHDFSServerPorts.testNameNodePorts
        org.apache.hadoop.hdfs.TestHDFSServerPorts.testDataNodePorts
        org.apache.hadoop.hdfs.TestHDFSServerPorts.testSecondaryNodePorts
        org.apache.hadoop.hdfs.TestHDFSTrash.testNonDefaultFS
        org.apache.hadoop.mapred.TestJobInProgress.testRunningTaskCount
        org.apache.hadoop.mapred.TestJobQueueInformation.testJobQueues
        org.apache.hadoop.mapred.TestJobTrackerRestart.testJobTrackerRestart
        org.apache.hadoop.mapred.TestMRServerPorts.testJobTrackerPorts
        org.apache.hadoop.mapred.TestMRServerPorts.testTaskTrackerPorts
        org.apache.hadoop.mapred.TestMiniMRDFSCaching.testWithDFS
        org.apache.hadoop.mapred.TestMiniMRLocalFS.testWithLocal
        org.apache.hadoop.security.authorize.TestServiceLevelAuthorization.testRefresh

        The logs of these failures show that these failures are not related to setsid and pidfiles stuff.

        Reliability test also passed.

        ant test-patch gave:

        [exec] +1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
        [exec]
        [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

        Show
        Ravi Gummadi added a comment - unit tests passed on my linux machine. The number of failures in unit tests on Windows reduced from 80(with trunk) to 18(with this patch). Here are the failures with this patch: org.apache.hadoop.hdfs.server.namenode.TestBackupNode.testCheckpoint org.apache.hadoop.fs.TestDFVariations.testOSParsing org.apache.hadoop.mapreduce.TestMapReduceLazyOutput.testLazyOutput org.apache.hadoop.cli.TestCLI.testAll org.apache.hadoop.fs.TestTrash.testTrash org.apache.hadoop.fs.TestTrash.testNonDefaultFS org.apache.hadoop.hdfs.TestHDFSServerPorts.testNameNodePorts org.apache.hadoop.hdfs.TestHDFSServerPorts.testDataNodePorts org.apache.hadoop.hdfs.TestHDFSServerPorts.testSecondaryNodePorts org.apache.hadoop.hdfs.TestHDFSTrash.testNonDefaultFS org.apache.hadoop.mapred.TestJobInProgress.testRunningTaskCount org.apache.hadoop.mapred.TestJobQueueInformation.testJobQueues org.apache.hadoop.mapred.TestJobTrackerRestart.testJobTrackerRestart org.apache.hadoop.mapred.TestMRServerPorts.testJobTrackerPorts org.apache.hadoop.mapred.TestMRServerPorts.testTaskTrackerPorts org.apache.hadoop.mapred.TestMiniMRDFSCaching.testWithDFS org.apache.hadoop.mapred.TestMiniMRLocalFS.testWithLocal org.apache.hadoop.security.authorize.TestServiceLevelAuthorization.testRefresh The logs of these failures show that these failures are not related to setsid and pidfiles stuff. Reliability test also passed. ant test-patch gave: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        +1 for the patch from my side. Can the committer please review it and before committing, run dos2unix on this patch? Thanks.

        Show
        Vinod Kumar Vavilapalli added a comment - +1 for the patch from my side. Can the committer please review it and before committing, run dos2unix on this patch? Thanks.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Ravi!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Ravi!
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #796 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/796/ )
        Hide
        Robert Chansler added a comment -

        Editorial pass over all release notes prior to publication of 0.21.

        Show
        Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.

          People

          • Assignee:
            Ravi Gummadi
            Reporter:
            Tsz Wo Nicholas Sze
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development