Issue Details (XML | Word | Printable)

Key: HADOOP-5059
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Unassigned
Reporter: Koji Noguchi
Votes: 0
Watchers: 15
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

'whoami', 'topologyscript' calls failing with java.io.IOException: error=12, Cannot allocate memory

Created: 15/Jan/09 07:36 PM   Updated: 09/Oct/09 05:57 PM
Return to search
Component/s: util
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File TestSysCall.java 2009-01-15 07:50 PM Koji Noguchi 1 kB
Environment:
On nodes with
physical memory 32G
Swap 16G

Primary/Secondary Namenode using 25G of heap or more

Issue Links:
Reference
 


 Description  « Hide
We've seen primary/secondary namenodes fail when calling whoami or topologyscripts.
(Discussed as part of HADOOP-4998)

Sample stack traces.

Primary Namenode

2009-01-12 03:57:27,381 WARN org.apache.hadoop.net.ScriptBasedMapping: java.io.IOException: Cannot run program
"/path/topologyProgram" (in directory "/path"):
java.io.IOException: error=12, Cannot allocate memory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
        at org.apache.hadoop.util.Shell.run(Shell.java:134)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
        at org.apache.hadoop.net.ScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:122)
        at org.apache.hadoop.net.ScriptBasedMapping.resolve(ScriptBasedMapping.java:73)
        at org.apache.hadoop.dfs.FSNamesystem$ResolutionMonitor.run(FSNamesystem.java:1869)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
        ... 7 more

2009-01-12 03:57:27,381 ERROR org.apache.hadoop.fs.FSNamesystem: The resolve call returned null! Using /default-rack
for some hosts
2009-01-12 03:57:27,381 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/55.5.55.55:50010

Secondary Namenode

2008-10-09 02:00:58,288 ERROR org.apache.hadoop.dfs.NameNode.Secondary: java.io.IOException:
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": java.io.IOException:
error=12, Cannot allocate memory
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
        at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
        at org.apache.hadoop.dfs.FSNamesystem.setConfigurationParameters(FSNamesystem.java:370)
        at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:359)
        at org.apache.hadoop.dfs.SecondaryNameNode.doMerge(SecondaryNameNode.java:340)
        at org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:312)
        at org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:223)
        at java.lang.Thread.run(Thread.java:619)

        at org.apache.hadoop.dfs.FSNamesystem.setConfigurationParameters(FSNamesystem.java:372)
        at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:359)
        at org.apache.hadoop.dfs.SecondaryNameNode.doMerge(SecondaryNameNode.java:340)
        at org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:312)
        at org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:223)
        at java.lang.Thread.run(Thread.java:619)



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Koji Noguchi added a comment - 15/Jan/09 07:37 PM
It's "java.io.IOException: error=12, Cannot allocate memory" but not OutOfMemoryException. Changing subject.

Koji Noguchi added a comment - 15/Jan/09 07:50 PM
Wrote a simple test.
On node with physical memory of 32G and swap of 16G (we didn't bother to increase the swap when we added a memory),

top - 19:46:19 up 109 days, 5:02, 1 user, load average: 0.37, 0.16, 0.09
Tasks: 188 total, 1 running, 187 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3% us, 0.1% sy, 0.2% ni, 99.3% id, 0.1% wa, 0.0% hi, 0.0% si
Mem: 32895200k total, 5766028k used, 27129172k free, 809300k buffers
Swap: 16386160k total, 93612k used, 16292548k free, 4352600k cached

bash-3.00$ cat /proc/sys/vm/overcommit_memory
0
bash-3.00$ /grid/0/java/jdk1.6.0_06_x64/bin/java -Xmx28000m -XX:NewSize=1G -XX:MaxNewSize=1G TestSysCall 260000000 100
Allocating 26000000000Bytes
Allocating done successfully
Calling ls ===================
Exception in thread "main" java.io.IOException: Cannot run program "ls": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at TestSysCall.main(TestSysCall.java:21)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 1 more
(root)

  1. echo 1 > /proc/sys/vm/overcommit_memory
    exit

bash-3.00$ cat /proc/sys/vm/overcommit_memory
1
bash-3.00$ /grid/0/java/jdk1.6.0_06_x64/bin/java -Xmx28000m -XX:NewSize=1G -XX:MaxNewSize=1G TestSysCall 260000000 100
Allocating 26000000000Bytes
Allocating done successfully
Calling ls ===================
TestSysCall.class
TestSysCall.java
hsperfdata_knoguchi
reip_local
ls done ===================
ls has taken 2156 milliseconds


Koji Noguchi added a comment - 15/Jan/09 07:55 PM
Checking with strace, it was failing on
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x4133c9f0) = -1
ENOMEM (Cannot allocate memory)
...
write(2, "Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory",
85Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory) = 85

This clone call didn't have CLONE_VM flag.
From clone manpage,
"If CLONE_VM is not set, the child process runs in a separate copy of the memory space of the calling process
at the time of clone. Memory writes or file mappings/unmappings performed by one of the processes do not affect the
other, as with fork(2). "

So it's probably using fork() and not vfork().


Doug Cutting added a comment - 15/Jan/09 09:22 PM
Based on the descriptions here:

http://lists.uclibc.org/pipermail/busybox/2005-December/017513.html

and here:

http://www.unixguide.net/unix/programming/1.1.2.shtml

It seems like Java is correct to use fork()+exec(), not vfork()+exec(). But that with really big processes, if your swap space isn't huge and you don't have overcommit_memory=1, you'll inevitably see these problems when you fork. The standard workaround seems to be to keep a subprocess around and re-use it, which has its own set of problems.

If you have either lots of swap space configured or have overcommit_memory=1overcommit_memory=1 then I don't think there's any performance penalty to using fork(). The new process has a huge address space that's nearly entirely shared with its parent for a short time, then it quickly shrinks down once the command is exec'd to something tiny, so it's harmless. So these solutions (increased swap or overcommit_memory=1) seem reasonable to me.

At root this seems like a bug in Linux, that you cannot spawn a new subprocess without temporarily using as much address space as the parent process, but it does not seem like a bug that's likely to be fixed soon.

Does this analysis sound right to others?


Doug Cutting added a comment - 15/Jan/09 09:33 PM
Based on:

http://www.win.tue.nl/~aeb/linux/lk/lk-9.html

It sounds like maybe the safer thing to do is to increase swap to equal RAM and set overcommit_memory=2.


Allen Wittenauer added a comment - 15/Jan/09 11:34 PM
That assumes you have an OS that supports overcommit. Most don't, and rightfully so, as it causes memory management from an operations perspective to be wildly unpredictable. Thus our issues.

Hadoop needs to do two things:

a) Move the topology program to be run as a completely separate daemon and open a socket to talk to it over the loopback interface. This trick has been used by squid (unlinkd) and many other applications quite effectively to offload all of the forking.

b) Stop forking for things it should be doing via a native method rather than putting reliances upon external applications being in certain locations or, worse, using a completely untrusted path. The output of programs are not guaranteed to be stable, even in POSIX-land.

As a sidenote to a, Owen has mentioned moving the topology program to be a Java loadable class. AFAIK, this won't work in the real world, as it means that in order to change the topology on the fly, we have to restart the namenode. Or worse, you'll need to get your admin team to learn Java.


Koji Noguchi added a comment - 16/Jan/09 12:06 AM

or have overcommit_memory=1 then I don't think there's any performance penalty to using fork().

Having a larger heap before fork()/exec() does slow down the calls.

Heap being used. Time for a single 'ls' call
260000000 40 milliseconds
2600000000 360 milliseconds
5200000000 569 milliseconds
7800000000 758 milliseconds
10400000000 994 milliseconds
13000000000 1186 milliseconds
15600000000 1417 milliseconds
18200000000 1564 milliseconds
20800000000 1792 milliseconds
23400000000 1910 milliseconds
26000000000 2094 milliseconds

Doug Cutting added a comment - 16/Jan/09 06:38 PM
Allen> Stop forking for things it should be doing via a native method [ ... ]

If we wish to do this uniformly, we might adopt APR (http://apr.apache.org/) and Tomcat's libtcnative (JNI bindings for APR).

Allen> in order to change the topology on the fly, we have to restart the namenode

Couldn't we add an admin command that reloads the topology on demand?

Koji> Having a larger heap before fork()/exec() does slow down the calls.

I guess that's the cost of copying the page table. Sigh. So we should probably move to a model where we either:

  • 1. talk to a daemon
  • 2. cache the output of shell commands for long periods (i.e., topology, groups, etc.)
  • 3. implement these with native code.

1 is fragile, since we have another daemon to manage.
3 means we don't run well out of the box on non-linux (e.g., MacOS, Solaris, & Windows)
2 may or may not be acceptable.


Devaraj Das added a comment - 16/Jan/09 07:07 PM
Can we have a mixture of the three? We first try to load the jni lib. If we fail, then try to contact the daemon. If the daemon is not available, then we launch the process. Of course, if we succeed in loading the JNI lib, we are fine. However, for the external daemon case, we need to take care of a case where the daemon may go down anytime... If and when that happens we switch to the process launch model (if we couldn't load the jni earlier on startup).. Thoughts?

Doug Cutting added a comment - 16/Jan/09 07:17 PM
> Can we have a mixture of the three?

Yikes! The same feature implemented three different ways? Two is bad enough, since, when they don't operate identically, confusion can ensue. Three is worse. One option might be to always use a Java daemon, but have the daemon either run shell scripts or native code. The daemon's heap would stay small, so forks would stay cheap and it we'd have a scalable portable solution, but for higher performance and/or security the native code could be used.


Raghu Angadi added a comment - 16/Jan/09 11:24 PM
>... [from above links] It seems like Java is correct to use fork()+exec(), not vfork()+exec(). [...]

just curious, why is it a bad idea for Java to use vfork()? The code on child process from return of vfork() till excecv() it is still in JVM's control.


Doug Cutting added a comment - 17/Jan/09 12:16 AM
> why is it a bad idea for Java to use vfork()?

vfork() is very fragile, since, until a call to exec is made, the new process runs in the same memory as its parent, including the stack, etc. The parent process is also suspended until exec() is called, but, still, the child can easily wreak havoc.

https://www.securecoding.cert.org/confluence/display/seccode/POS33-C.+Do+not+use+vfork()

That said, it seems like folks do still use vfork() to get around this problem, e.g.:

http://bugs.sun.com/view_bug.do?bug_id=5049299
http://sources.redhat.com/ml/glibc-bugs/2004-09/msg00045.html


Steve Loughran added a comment - 19/Jan/09 11:23 AM
A daemon would work, given that Hadoop RPC could be used as the protocol for talking to it, the alternative being (usually) something serialized over stdin and stdout. There are some serious security risks associated with having an OS-services daemon listening on a network port. By decoupling it you would even be able to deal with memory leak issues in any embedded libraries, just restart the daemon every few hours.