Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-5059

'whoami', 'topologyscript' calls failing with java.io.IOException: error=12, Cannot allocate memory

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: util
    • Labels:
      None
    • Environment:

      On nodes with
      physical memory 32G
      Swap 16G

      Primary/Secondary Namenode using 25G of heap or more

      Description

      We've seen primary/secondary namenodes fail when calling whoami or topologyscripts.
      (Discussed as part of HADOOP-4998)

      Sample stack traces.

      Primary Namenode

      2009-01-12 03:57:27,381 WARN org.apache.hadoop.net.ScriptBasedMapping: java.io.IOException: Cannot run program
      "/path/topologyProgram" (in directory "/path"):
      java.io.IOException: error=12, Cannot allocate memory
              at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
              at org.apache.hadoop.util.Shell.run(Shell.java:134)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
              at org.apache.hadoop.net.ScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:122)
              at org.apache.hadoop.net.ScriptBasedMapping.resolve(ScriptBasedMapping.java:73)
              at org.apache.hadoop.dfs.FSNamesystem$ResolutionMonitor.run(FSNamesystem.java:1869)
              at java.lang.Thread.run(Thread.java:619)
      Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
              at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
              at java.lang.ProcessImpl.start(ProcessImpl.java:65)
              at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
              ... 7 more
      
      2009-01-12 03:57:27,381 ERROR org.apache.hadoop.fs.FSNamesystem: The resolve call returned null! Using /default-rack
      for some hosts
      2009-01-12 03:57:27,381 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/55.5.55.55:50010
      
      

      Secondary Namenode

      2008-10-09 02:00:58,288 ERROR org.apache.hadoop.dfs.NameNode.Secondary: java.io.IOException:
      javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": java.io.IOException:
      error=12, Cannot allocate memory
              at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
              at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
              at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
              at org.apache.hadoop.dfs.FSNamesystem.setConfigurationParameters(FSNamesystem.java:370)
              at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:359)
              at org.apache.hadoop.dfs.SecondaryNameNode.doMerge(SecondaryNameNode.java:340)
              at org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:312)
              at org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:223)
              at java.lang.Thread.run(Thread.java:619)
      
              at org.apache.hadoop.dfs.FSNamesystem.setConfigurationParameters(FSNamesystem.java:372)
              at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:359)
              at org.apache.hadoop.dfs.SecondaryNameNode.doMerge(SecondaryNameNode.java:340)
              at org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:312)
              at org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:223)
              at java.lang.Thread.run(Thread.java:619)
      
      

        Issue Links

          Activity

          Hide
          aw Allen Wittenauer added a comment -

          I'm going to close this as fixed.

          A bunch of things have happened:

          a) On certain platforms, java now uses posix_spawn() instead of fork().

          b) Topology can now be provided by a class.

          c) The whoami call has been removed.

          So there are definitely ways to mitigate/eliminate this issue.

          Show
          aw Allen Wittenauer added a comment - I'm going to close this as fixed. A bunch of things have happened: a) On certain platforms, java now uses posix_spawn() instead of fork(). b) Topology can now be provided by a class. c) The whoami call has been removed. So there are definitely ways to mitigate/eliminate this issue.
          Hide
          jukkaz Jukka Zitting added a comment -

          See http://developers.sun.com/solaris/articles/subprocess/subprocess.html for relevant documentation. I'm working on a similar problem in TIKA-591and JCR-2864. Perhaps we could pool efforts for solving this in somewhere like Commons Exec?

          Show
          jukkaz Jukka Zitting added a comment - See http://developers.sun.com/solaris/articles/subprocess/subprocess.html for relevant documentation. I'm working on a similar problem in TIKA-591 and JCR-2864 . Perhaps we could pool efforts for solving this in somewhere like Commons Exec?
          Hide
          steve_l Steve Loughran added a comment -

          A daemon would work, given that Hadoop RPC could be used as the protocol for talking to it, the alternative being (usually) something serialized over stdin and stdout. There are some serious security risks associated with having an OS-services daemon listening on a network port. By decoupling it you would even be able to deal with memory leak issues in any embedded libraries, just restart the daemon every few hours.

          Show
          steve_l Steve Loughran added a comment - A daemon would work, given that Hadoop RPC could be used as the protocol for talking to it, the alternative being (usually) something serialized over stdin and stdout. There are some serious security risks associated with having an OS-services daemon listening on a network port. By decoupling it you would even be able to deal with memory leak issues in any embedded libraries, just restart the daemon every few hours.
          Hide
          cutting Doug Cutting added a comment -

          > why is it a bad idea for Java to use vfork()?

          vfork() is very fragile, since, until a call to exec is made, the new process runs in the same memory as its parent, including the stack, etc. The parent process is also suspended until exec() is called, but, still, the child can easily wreak havoc.

          https://www.securecoding.cert.org/confluence/display/seccode/POS33-C.+Do+not+use+vfork()

          That said, it seems like folks do still use vfork() to get around this problem, e.g.:

          http://bugs.sun.com/view_bug.do?bug_id=5049299
          http://sources.redhat.com/ml/glibc-bugs/2004-09/msg00045.html

          Show
          cutting Doug Cutting added a comment - > why is it a bad idea for Java to use vfork()? vfork() is very fragile, since, until a call to exec is made, the new process runs in the same memory as its parent, including the stack, etc. The parent process is also suspended until exec() is called, but, still, the child can easily wreak havoc. https://www.securecoding.cert.org/confluence/display/seccode/POS33-C.+Do+not+use+vfork( ) That said, it seems like folks do still use vfork() to get around this problem, e.g.: http://bugs.sun.com/view_bug.do?bug_id=5049299 http://sources.redhat.com/ml/glibc-bugs/2004-09/msg00045.html
          Hide
          rangadi Raghu Angadi added a comment -

          >... [from above links] It seems like Java is correct to use fork()+exec(), not vfork()+exec(). [...]

          just curious, why is it a bad idea for Java to use vfork()? The code on child process from return of vfork() till excecv() it is still in JVM's control.

          Show
          rangadi Raghu Angadi added a comment - >... [from above links] It seems like Java is correct to use fork()+exec(), not vfork()+exec(). [...] just curious, why is it a bad idea for Java to use vfork()? The code on child process from return of vfork() till excecv() it is still in JVM's control.
          Hide
          cutting Doug Cutting added a comment -

          > Can we have a mixture of the three?

          Yikes! The same feature implemented three different ways? Two is bad enough, since, when they don't operate identically, confusion can ensue. Three is worse. One option might be to always use a Java daemon, but have the daemon either run shell scripts or native code. The daemon's heap would stay small, so forks would stay cheap and it we'd have a scalable portable solution, but for higher performance and/or security the native code could be used.

          Show
          cutting Doug Cutting added a comment - > Can we have a mixture of the three? Yikes! The same feature implemented three different ways? Two is bad enough, since, when they don't operate identically, confusion can ensue. Three is worse. One option might be to always use a Java daemon, but have the daemon either run shell scripts or native code. The daemon's heap would stay small, so forks would stay cheap and it we'd have a scalable portable solution, but for higher performance and/or security the native code could be used.
          Hide
          devaraj Devaraj Das added a comment -

          Can we have a mixture of the three? We first try to load the jni lib. If we fail, then try to contact the daemon. If the daemon is not available, then we launch the process. Of course, if we succeed in loading the JNI lib, we are fine. However, for the external daemon case, we need to take care of a case where the daemon may go down anytime... If and when that happens we switch to the process launch model (if we couldn't load the jni earlier on startup).. Thoughts?

          Show
          devaraj Devaraj Das added a comment - Can we have a mixture of the three? We first try to load the jni lib. If we fail, then try to contact the daemon. If the daemon is not available, then we launch the process. Of course, if we succeed in loading the JNI lib, we are fine. However, for the external daemon case, we need to take care of a case where the daemon may go down anytime... If and when that happens we switch to the process launch model (if we couldn't load the jni earlier on startup).. Thoughts?
          Hide
          cutting Doug Cutting added a comment -

          Allen> Stop forking for things it should be doing via a native method [ ... ]

          If we wish to do this uniformly, we might adopt APR (http://apr.apache.org/) and Tomcat's libtcnative (JNI bindings for APR).

          Allen> in order to change the topology on the fly, we have to restart the namenode

          Couldn't we add an admin command that reloads the topology on demand?

          Koji> Having a larger heap before fork()/exec() does slow down the calls.

          I guess that's the cost of copying the page table. Sigh. So we should probably move to a model where we either:

          • 1. talk to a daemon
          • 2. cache the output of shell commands for long periods (i.e., topology, groups, etc.)
          • 3. implement these with native code.

          1 is fragile, since we have another daemon to manage.
          3 means we don't run well out of the box on non-linux (e.g., MacOS, Solaris, & Windows)
          2 may or may not be acceptable.

          Show
          cutting Doug Cutting added a comment - Allen> Stop forking for things it should be doing via a native method [ ... ] If we wish to do this uniformly, we might adopt APR ( http://apr.apache.org/ ) and Tomcat's libtcnative (JNI bindings for APR). Allen> in order to change the topology on the fly, we have to restart the namenode Couldn't we add an admin command that reloads the topology on demand? Koji> Having a larger heap before fork()/exec() does slow down the calls. I guess that's the cost of copying the page table. Sigh. So we should probably move to a model where we either: 1. talk to a daemon 2. cache the output of shell commands for long periods (i.e., topology, groups, etc.) 3. implement these with native code. 1 is fragile, since we have another daemon to manage. 3 means we don't run well out of the box on non-linux (e.g., MacOS, Solaris, & Windows) 2 may or may not be acceptable.
          Hide
          knoguchi Koji Noguchi added a comment -

          or have overcommit_memory=1 then I don't think there's any performance penalty to using fork().

          Having a larger heap before fork()/exec() does slow down the calls.

          Heap being used. Time for a single 'ls' call
          260000000 40 milliseconds
          2600000000 360 milliseconds
          5200000000 569 milliseconds
          7800000000 758 milliseconds
          10400000000 994 milliseconds
          13000000000 1186 milliseconds
          15600000000 1417 milliseconds
          18200000000 1564 milliseconds
          20800000000 1792 milliseconds
          23400000000 1910 milliseconds
          26000000000 2094 milliseconds
          Show
          knoguchi Koji Noguchi added a comment - or have overcommit_memory=1 then I don't think there's any performance penalty to using fork(). Having a larger heap before fork()/exec() does slow down the calls. Heap being used. Time for a single 'ls' call 260000000 40 milliseconds 2600000000 360 milliseconds 5200000000 569 milliseconds 7800000000 758 milliseconds 10400000000 994 milliseconds 13000000000 1186 milliseconds 15600000000 1417 milliseconds 18200000000 1564 milliseconds 20800000000 1792 milliseconds 23400000000 1910 milliseconds 26000000000 2094 milliseconds
          Hide
          aw Allen Wittenauer added a comment -

          That assumes you have an OS that supports overcommit. Most don't, and rightfully so, as it causes memory management from an operations perspective to be wildly unpredictable. Thus our issues.

          Hadoop needs to do two things:

          a) Move the topology program to be run as a completely separate daemon and open a socket to talk to it over the loopback interface. This trick has been used by squid (unlinkd) and many other applications quite effectively to offload all of the forking.

          b) Stop forking for things it should be doing via a native method rather than putting reliances upon external applications being in certain locations or, worse, using a completely untrusted path. The output of programs are not guaranteed to be stable, even in POSIX-land.

          As a sidenote to a, Owen has mentioned moving the topology program to be a Java loadable class. AFAIK, this won't work in the real world, as it means that in order to change the topology on the fly, we have to restart the namenode. Or worse, you'll need to get your admin team to learn Java.

          Show
          aw Allen Wittenauer added a comment - That assumes you have an OS that supports overcommit. Most don't, and rightfully so, as it causes memory management from an operations perspective to be wildly unpredictable. Thus our issues. Hadoop needs to do two things: a) Move the topology program to be run as a completely separate daemon and open a socket to talk to it over the loopback interface. This trick has been used by squid (unlinkd) and many other applications quite effectively to offload all of the forking. b) Stop forking for things it should be doing via a native method rather than putting reliances upon external applications being in certain locations or, worse, using a completely untrusted path. The output of programs are not guaranteed to be stable, even in POSIX-land. As a sidenote to a, Owen has mentioned moving the topology program to be a Java loadable class. AFAIK, this won't work in the real world, as it means that in order to change the topology on the fly, we have to restart the namenode. Or worse, you'll need to get your admin team to learn Java.
          Hide
          cutting Doug Cutting added a comment -

          Based on:

          http://www.win.tue.nl/~aeb/linux/lk/lk-9.html

          It sounds like maybe the safer thing to do is to increase swap to equal RAM and set overcommit_memory=2.

          Show
          cutting Doug Cutting added a comment - Based on: http://www.win.tue.nl/~aeb/linux/lk/lk-9.html It sounds like maybe the safer thing to do is to increase swap to equal RAM and set overcommit_memory=2.
          Hide
          cutting Doug Cutting added a comment -

          Based on the descriptions here:

          http://lists.uclibc.org/pipermail/busybox/2005-December/017513.html

          and here:

          http://www.unixguide.net/unix/programming/1.1.2.shtml

          It seems like Java is correct to use fork()+exec(), not vfork()+exec(). But that with really big processes, if your swap space isn't huge and you don't have overcommit_memory=1, you'll inevitably see these problems when you fork. The standard workaround seems to be to keep a subprocess around and re-use it, which has its own set of problems.

          If you have either lots of swap space configured or have overcommit_memory=1overcommit_memory=1 then I don't think there's any performance penalty to using fork(). The new process has a huge address space that's nearly entirely shared with its parent for a short time, then it quickly shrinks down once the command is exec'd to something tiny, so it's harmless. So these solutions (increased swap or overcommit_memory=1) seem reasonable to me.

          At root this seems like a bug in Linux, that you cannot spawn a new subprocess without temporarily using as much address space as the parent process, but it does not seem like a bug that's likely to be fixed soon.

          Does this analysis sound right to others?

          Show
          cutting Doug Cutting added a comment - Based on the descriptions here: http://lists.uclibc.org/pipermail/busybox/2005-December/017513.html and here: http://www.unixguide.net/unix/programming/1.1.2.shtml It seems like Java is correct to use fork()+exec(), not vfork()+exec(). But that with really big processes, if your swap space isn't huge and you don't have overcommit_memory=1, you'll inevitably see these problems when you fork. The standard workaround seems to be to keep a subprocess around and re-use it, which has its own set of problems. If you have either lots of swap space configured or have overcommit_memory=1overcommit_memory=1 then I don't think there's any performance penalty to using fork(). The new process has a huge address space that's nearly entirely shared with its parent for a short time, then it quickly shrinks down once the command is exec'd to something tiny, so it's harmless. So these solutions (increased swap or overcommit_memory=1) seem reasonable to me. At root this seems like a bug in Linux, that you cannot spawn a new subprocess without temporarily using as much address space as the parent process, but it does not seem like a bug that's likely to be fixed soon. Does this analysis sound right to others?
          Hide
          knoguchi Koji Noguchi added a comment -

          Checking with strace, it was failing on

          clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x4133c9f0) = -1
          ENOMEM (Cannot allocate memory)
          ...
          write(2, "Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory",
          85Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory) = 85
          
          

          This clone call didn't have CLONE_VM flag.
          From clone manpage,
          "If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process
          at the time of clone. Memory writes or file mappings/unmappings performed by one of the processes do not affect the
          other, as with fork(2). "

          So it's probably using fork() and not vfork().

          Show
          knoguchi Koji Noguchi added a comment - Checking with strace, it was failing on clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x4133c9f0) = -1 ENOMEM (Cannot allocate memory) ... write(2, "Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory", 85Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory) = 85 This clone call didn't have CLONE_VM flag. From clone manpage, "If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process at the time of clone. Memory writes or file mappings/unmappings performed by one of the processes do not affect the other, as with fork(2). " So it's probably using fork() and not vfork().
          Hide
          knoguchi Koji Noguchi added a comment -

          Wrote a simple test.
          On node with physical memory of 32G and swap of 16G (we didn't bother to increase the swap when we added a memory),

          top - 19:46:19 up 109 days, 5:02, 1 user, load average: 0.37, 0.16, 0.09
          Tasks: 188 total, 1 running, 187 sleeping, 0 stopped, 0 zombie
          Cpu(s): 0.3% us, 0.1% sy, 0.2% ni, 99.3% id, 0.1% wa, 0.0% hi, 0.0% si
          Mem: 32895200k total, 5766028k used, 27129172k free, 809300k buffers
          Swap: 16386160k total, 93612k used, 16292548k free, 4352600k cached

          bash-3.00$ cat /proc/sys/vm/overcommit_memory
          0
          bash-3.00$ /grid/0/java/jdk1.6.0_06_x64/bin/java -Xmx28000m -XX:NewSize=1G -XX:MaxNewSize=1G TestSysCall 260000000 100
          Allocating 26000000000Bytes
          Allocating done successfully
          Calling ls ===================
          Exception in thread "main" java.io.IOException: Cannot run program "ls": java.io.IOException: error=12, Cannot allocate memory
          at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
          at TestSysCall.main(TestSysCall.java:21)
          Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
          at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
          at java.lang.ProcessImpl.start(ProcessImpl.java:65)
          at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
          ... 1 more
          (root)

          1. echo 1 > /proc/sys/vm/overcommit_memory
            exit

          bash-3.00$ cat /proc/sys/vm/overcommit_memory
          1
          bash-3.00$ /grid/0/java/jdk1.6.0_06_x64/bin/java -Xmx28000m -XX:NewSize=1G -XX:MaxNewSize=1G TestSysCall 260000000 100
          Allocating 26000000000Bytes
          Allocating done successfully
          Calling ls ===================
          TestSysCall.class
          TestSysCall.java
          hsperfdata_knoguchi
          reip_local
          ls done ===================
          ls has taken 2156 milliseconds

          Show
          knoguchi Koji Noguchi added a comment - Wrote a simple test. On node with physical memory of 32G and swap of 16G (we didn't bother to increase the swap when we added a memory), top - 19:46:19 up 109 days, 5:02, 1 user, load average: 0.37, 0.16, 0.09 Tasks: 188 total, 1 running, 187 sleeping, 0 stopped, 0 zombie Cpu(s): 0.3% us, 0.1% sy, 0.2% ni, 99.3% id, 0.1% wa, 0.0% hi, 0.0% si Mem: 32895200k total, 5766028k used, 27129172k free, 809300k buffers Swap: 16386160k total, 93612k used, 16292548k free, 4352600k cached bash-3.00$ cat /proc/sys/vm/overcommit_memory 0 bash-3.00$ /grid/0/java/jdk1.6.0_06_x64/bin/java -Xmx28000m -XX:NewSize=1G -XX:MaxNewSize=1G TestSysCall 260000000 100 Allocating 26000000000Bytes Allocating done successfully Calling ls =================== Exception in thread "main" java.io.IOException: Cannot run program "ls": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at TestSysCall.main(TestSysCall.java:21) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.<init>(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 1 more (root) echo 1 > /proc/sys/vm/overcommit_memory exit bash-3.00$ cat /proc/sys/vm/overcommit_memory 1 bash-3.00$ /grid/0/java/jdk1.6.0_06_x64/bin/java -Xmx28000m -XX:NewSize=1G -XX:MaxNewSize=1G TestSysCall 260000000 100 Allocating 26000000000Bytes Allocating done successfully Calling ls =================== TestSysCall.class TestSysCall.java hsperfdata_knoguchi reip_local ls done =================== ls has taken 2156 milliseconds
          Hide
          knoguchi Koji Noguchi added a comment -

          It's "java.io.IOException: error=12, Cannot allocate memory" but not OutOfMemoryException. Changing subject.

          Show
          knoguchi Koji Noguchi added a comment - It's "java.io.IOException: error=12, Cannot allocate memory" but not OutOfMemoryException. Changing subject.

            People

            • Assignee:
              Unassigned
              Reporter:
              knoguchi Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              27 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development