Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: util
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I have seen one of the task failures with following exception:
      java.lang.NullPointerException
      at java.lang.ProcessBuilder.start(ProcessBuilder.java:441)
      at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
      at org.apache.hadoop.util.Shell.run(Shell.java:134)
      at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
      at org.apache.hadoop.util.ProcessTree.isAlive(ProcessTree.java:244)
      at org.apache.hadoop.util.ProcessTree.sigKillInCurrentThread(ProcessTree.java:67)
      at org.apache.hadoop.util.ProcessTree.sigKill(ProcessTree.java:115)
      at org.apache.hadoop.util.ProcessTree.destroyProcessGroup(ProcessTree.java:164)
      at org.apache.hadoop.util.ProcessTree.destroy(ProcessTree.java:180)
      at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:377)
      at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:249)
      at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:113)
      at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:76)
      at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:411)

      1. patch-5198.txt
        1 kB
        Amareshwari Sriramadasu
      2. patch-5198-1.txt
        2 kB
        Amareshwari Sriramadasu

        Activity

        Hide
        Amareshwari Sriramadasu added a comment -

        corresponding TaskTracker logs :

        2009-02-09 10:32:54,062 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200902090818_0001_m_009589_1 task's state:KILLED_UNCLEAN
        2009-02-09 10:32:54,062 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200902090818_0001_m_009589_1
        2009-02-09 10:32:54,062 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200902090818_0001_m_009589_1
        2009-02-09 10:32:54,281 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_200902090818_0001_m_-1842952603
        2009-02-09 10:32:54,281 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_200902090818_0001_m_-1842952603 spawned.                                    
        2009-02-09 10:32:56,119 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_200902090818_0001_m_-1842952603 given task: attempt_200902090818_0001_m_009589_1
        2009-02-09 10:32:57,031 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200902090818_0001_m_009589_1 0.0%
        2009-02-09 10:32:57,048 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200902090818_0001_m_009589_1 0.0% cleanup
        2009-02-09 10:32:57,050 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_200902090818_0001_m_009589_1 is done.
        2009-02-09 10:32:57,050 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_200902090818_0001_m_009589_1  was 0
        2009-02-09 10:32:57,052 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200902090818_0001_m_009589_1 done; removing files.
        2009-02-09 10:32:57,055 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 1
        2009-02-09 10:32:57,070 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200902090818_0001_m_009910_1 task's state:UNASSIGNED
        2009-02-09 10:32:57,070 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200902090818_0001_m_009910_1
        2009-02-09 10:32:57,070 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200902090818_0001_m_009910_1
        2009-02-09 10:32:57,200 INFO org.apache.hadoop.mapred.JvmManager: Killing JVM: jvm_200902090818_0001_m_-1842952603
        2009-02-09 10:32:57,251 WARN org.apache.hadoop.util.ProcessTree: Error executing shell command org.apache.hadoop.util.Shell$ExitCodeException: kill: can't find process "-null"
        2009-02-09 10:32:57,251 INFO org.apache.hadoop.util.ProcessTree: Killing all processes in the process group null with SIGTERM. Exit code 1
        2009-02-09 10:32:57,252 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_200902090818_0001_m_-1842952603 exited. Number of tasks it ran: 1
        2009-02-09 10:32:57,256 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200902090818_0001_m_009910_1 Child Error
        java.lang.NullPointerException        
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:441)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)        
        at org.apache.hadoop.util.Shell.run(Shell.java:134)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)        
        at org.apache.hadoop.util.ProcessTree.isAlive(ProcessTree.java:244)
        at org.apache.hadoop.util.ProcessTree.sigKillInCurrentThread(ProcessTree.java:67)        
        at org.apache.hadoop.util.ProcessTree.sigKill(ProcessTree.java:115)
        at org.apache.hadoop.util.ProcessTree.destroyProcessGroup(ProcessTree.java:164)        
        at org.apache.hadoop.util.ProcessTree.destroy(ProcessTree.java:180)
        at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:377)        
        at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:249)
        at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:113)        
        at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:76)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:411)                                                                                      
        
        Show
        Amareshwari Sriramadasu added a comment - corresponding TaskTracker logs : 2009-02-09 10:32:54,062 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200902090818_0001_m_009589_1 task's state:KILLED_UNCLEAN 2009-02-09 10:32:54,062 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200902090818_0001_m_009589_1 2009-02-09 10:32:54,062 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200902090818_0001_m_009589_1 2009-02-09 10:32:54,281 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_200902090818_0001_m_-1842952603 2009-02-09 10:32:54,281 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_200902090818_0001_m_-1842952603 spawned. 2009-02-09 10:32:56,119 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_200902090818_0001_m_-1842952603 given task: attempt_200902090818_0001_m_009589_1 2009-02-09 10:32:57,031 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200902090818_0001_m_009589_1 0.0% 2009-02-09 10:32:57,048 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200902090818_0001_m_009589_1 0.0% cleanup 2009-02-09 10:32:57,050 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_200902090818_0001_m_009589_1 is done. 2009-02-09 10:32:57,050 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_200902090818_0001_m_009589_1 was 0 2009-02-09 10:32:57,052 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200902090818_0001_m_009589_1 done; removing files. 2009-02-09 10:32:57,055 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 1 2009-02-09 10:32:57,070 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200902090818_0001_m_009910_1 task's state:UNASSIGNED 2009-02-09 10:32:57,070 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200902090818_0001_m_009910_1 2009-02-09 10:32:57,070 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200902090818_0001_m_009910_1 2009-02-09 10:32:57,200 INFO org.apache.hadoop.mapred.JvmManager: Killing JVM: jvm_200902090818_0001_m_-1842952603 2009-02-09 10:32:57,251 WARN org.apache.hadoop.util.ProcessTree: Error executing shell command org.apache.hadoop.util.Shell$ExitCodeException: kill: can't find process "-null" 2009-02-09 10:32:57,251 INFO org.apache.hadoop.util.ProcessTree: Killing all processes in the process group null with SIGTERM. Exit code 1 2009-02-09 10:32:57,252 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_200902090818_0001_m_-1842952603 exited. Number of tasks it ran: 1 2009-02-09 10:32:57,256 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200902090818_0001_m_009910_1 Child Error java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:441) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286) at org.apache.hadoop.util.ProcessTree.isAlive(ProcessTree.java:244) at org.apache.hadoop.util.ProcessTree.sigKillInCurrentThread(ProcessTree.java:67) at org.apache.hadoop.util.ProcessTree.sigKill(ProcessTree.java:115) at org.apache.hadoop.util.ProcessTree.destroyProcessGroup(ProcessTree.java:164) at org.apache.hadoop.util.ProcessTree.destroy(ProcessTree.java:180) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:377) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:249) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:113) at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:76) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:411)
        Hide
        Ravi Gummadi added a comment -

        Looks like the issue is because of patch of JIRA-2721.
        Is the problem reproducible (Is the failure consistent) ? Would you please provide a testcase to reproduce the problem ?

        Show
        Ravi Gummadi added a comment - Looks like the issue is because of patch of JIRA-2721. Is the problem reproducible (Is the failure consistent) ? Would you please provide a testcase to reproduce the problem ?
        Hide
        Amareshwari Sriramadasu added a comment -

        Ravi, it is not easy to reproduce. Out of my 10000 maps, one of the tasks had the failure. From the logs, it looks like some race condition.

        Show
        Amareshwari Sriramadasu added a comment - Ravi, it is not easy to reproduce. Out of my 10000 maps, one of the tasks had the failure. From the logs, it looks like some race condition.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Even I observed two instances of this problem on a big cluster with a JT that was running for nearly 6 hours. A preliminary analysis does point to a race condition between a jvm just cleaning up and a new task trying to kill this old jvm for a free slot.

        Show
        Vinod Kumar Vavilapalli added a comment - Even I observed two instances of this problem on a big cluster with a JT that was running for nearly 6 hours. A preliminary analysis does point to a race condition between a jvm just cleaning up and a new task trying to kill this old jvm for a free slot.
        Hide
        Amareshwari Sriramadasu added a comment -

        Attaching patch with not-null check for pid, before passing it to kill-process.

        The NPE occurs when jvmIdToRunner map contains the runner and pid file is already cleanedup.
        The scenario: successful tasks (reducers or cleanup attempts) cleanup their files as early as possible and jvmIdToRunner entry is deleted in updateOnJvmExit. If a LaunchTaskAction comes before updateOnJvmExit call, reapJvm for new task still finds the jvmRunner and tries to kill it, thereby NPE for the pid.
        So, one solution is reapJvm need not kill the process when pid is null, because the task has reported done and jvm is on the way to exit already (in all the cases)

        Thoughts?

        Show
        Amareshwari Sriramadasu added a comment - Attaching patch with not-null check for pid, before passing it to kill-process. The NPE occurs when jvmIdToRunner map contains the runner and pid file is already cleanedup. The scenario: successful tasks (reducers or cleanup attempts) cleanup their files as early as possible and jvmIdToRunner entry is deleted in updateOnJvmExit. If a LaunchTaskAction comes before updateOnJvmExit call, reapJvm for new task still finds the jvmRunner and tries to kill it, thereby NPE for the pid. So, one solution is reapJvm need not kill the process when pid is null, because the task has reported done and jvm is on the way to exit already (in all the cases) Thoughts?
        Hide
        Amareshwari Sriramadasu added a comment -

        trying to run by hudson

        Show
        Amareshwari Sriramadasu added a comment - trying to run by hudson
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12400159/patch-5198.txt
        against trunk revision 744406.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12400159/patch-5198.txt against trunk revision 744406. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3868/console This message is automatically generated.
        Hide
        Nigel Daley added a comment -

        Can pid become null between where you check for it and where you call ProcessTree.destroy(...)?

        Can you write a unit test? If not, can you describe the steps of the manual testing you did to ensure this patch works?

        Show
        Nigel Daley added a comment - Can pid become null between where you check for it and where you call ProcessTree.destroy(...)? Can you write a unit test? If not, can you describe the steps of the manual testing you did to ensure this patch works?
        Hide
        Amareshwari Sriramadasu added a comment -

        Can pid become null between where you check for it and where you call ProcessTree.destroy(...)?

        No. pid cannnot become null. Once obtained from the file, it is stored in a String and passed to the kill-process.

        Can you write a unit test? If not, can you describe the steps of the manual testing you did to ensure this patch works?

        This is almost all the time reproducible by running MRReliability test on single-node cluster and occasional failures on cluster. I didnt see any failure after the patch.

        Show
        Amareshwari Sriramadasu added a comment - Can pid become null between where you check for it and where you call ProcessTree.destroy(...)? No. pid cannnot become null. Once obtained from the file, it is stored in a String and passed to the kill-process. Can you write a unit test? If not, can you describe the steps of the manual testing you did to ensure this patch works? This is almost all the time reproducible by running MRReliability test on single-node cluster and occasional failures on cluster. I didnt see any failure after the patch.
        Hide
        Ravi Gummadi added a comment -

        Patch looks good.

        +1.

        Show
        Ravi Gummadi added a comment - Patch looks good. +1.
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch updated with trunk

        Show
        Amareshwari Sriramadasu added a comment - Patch updated with trunk
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12404112/patch-5198-1.txt
        against trunk revision 759398.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12404112/patch-5198-1.txt against trunk revision 759398. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/78/console This message is automatically generated.
        Hide
        Devaraj Das added a comment -

        I just committed this. Thanks, Amareshwari!

        Show
        Devaraj Das added a comment - I just committed this. Thanks, Amareshwari!
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #796 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/796/ )

          People

          • Assignee:
            Amareshwari Sriramadasu
            Reporter:
            Amareshwari Sriramadasu
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development