|
[
Permlink
| « Hide
]
Koji Noguchi added a comment - 15/Jan/09 07:37 PM
It's "java.io.IOException: error=12, Cannot allocate memory" but not OutOfMemoryException. Changing subject.
Wrote a simple test.
On node with physical memory of 32G and swap of 16G (we didn't bother to increase the swap when we added a memory), top - 19:46:19 up 109 days, 5:02, 1 user, load average: 0.37, 0.16, 0.09 bash-3.00$ cat /proc/sys/vm/overcommit_memory
bash-3.00$ cat /proc/sys/vm/overcommit_memory Checking with strace, it was failing on
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x4133c9f0) = -1 ENOMEM (Cannot allocate memory) ... write(2, "Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory", 85Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory) = 85 This clone call didn't have CLONE_VM flag. So it's probably using fork() and not vfork(). Based on the descriptions here:
http://lists.uclibc.org/pipermail/busybox/2005-December/017513.html and here: http://www.unixguide.net/unix/programming/1.1.2.shtml It seems like Java is correct to use fork()+exec(), not vfork()+exec(). But that with really big processes, if your swap space isn't huge and you don't have overcommit_memory=1, you'll inevitably see these problems when you fork. The standard workaround seems to be to keep a subprocess around and re-use it, which has its own set of problems. If you have either lots of swap space configured or have overcommit_memory=1overcommit_memory=1 then I don't think there's any performance penalty to using fork(). The new process has a huge address space that's nearly entirely shared with its parent for a short time, then it quickly shrinks down once the command is exec'd to something tiny, so it's harmless. So these solutions (increased swap or overcommit_memory=1) seem reasonable to me. At root this seems like a bug in Linux, that you cannot spawn a new subprocess without temporarily using as much address space as the parent process, but it does not seem like a bug that's likely to be fixed soon. Does this analysis sound right to others? Based on:
http://www.win.tue.nl/~aeb/linux/lk/lk-9.html It sounds like maybe the safer thing to do is to increase swap to equal RAM and set overcommit_memory=2. That assumes you have an OS that supports overcommit. Most don't, and rightfully so, as it causes memory management from an operations perspective to be wildly unpredictable. Thus our issues.
Hadoop needs to do two things: a) Move the topology program to be run as a completely separate daemon and open a socket to talk to it over the loopback interface. This trick has been used by squid (unlinkd) and many other applications quite effectively to offload all of the forking. b) Stop forking for things it should be doing via a native method rather than putting reliances upon external applications being in certain locations or, worse, using a completely untrusted path. The output of programs are not guaranteed to be stable, even in POSIX-land. As a sidenote to a, Owen has mentioned moving the topology program to be a Java loadable class. AFAIK, this won't work in the real world, as it means that in order to change the topology on the fly, we have to restart the namenode. Or worse, you'll need to get your admin team to learn Java.
Having a larger heap before fork()/exec() does slow down the calls.
Allen> Stop forking for things it should be doing via a native method [ ... ]
If we wish to do this uniformly, we might adopt APR (http://apr.apache.org/ Allen> in order to change the topology on the fly, we have to restart the namenode Couldn't we add an admin command that reloads the topology on demand? Koji> Having a larger heap before fork()/exec() does slow down the calls. I guess that's the cost of copying the page table. Sigh. So we should probably move to a model where we either:
1 is fragile, since we have another daemon to manage. Can we have a mixture of the three? We first try to load the jni lib. If we fail, then try to contact the daemon. If the daemon is not available, then we launch the process. Of course, if we succeed in loading the JNI lib, we are fine. However, for the external daemon case, we need to take care of a case where the daemon may go down anytime... If and when that happens we switch to the process launch model (if we couldn't load the jni earlier on startup).. Thoughts?
> Can we have a mixture of the three?
Yikes! The same feature implemented three different ways? Two is bad enough, since, when they don't operate identically, confusion can ensue. Three is worse. One option might be to always use a Java daemon, but have the daemon either run shell scripts or native code. The daemon's heap would stay small, so forks would stay cheap and it we'd have a scalable portable solution, but for higher performance and/or security the native code could be used. >... [from above links] It seems like Java is correct to use fork()+exec(), not vfork()+exec(). [...]
just curious, why is it a bad idea for Java to use vfork()? The code on child process from return of vfork() till excecv() it is still in JVM's control. > why is it a bad idea for Java to use vfork()?
vfork() is very fragile, since, until a call to exec is made, the new process runs in the same memory as its parent, including the stack, etc. The parent process is also suspended until exec() is called, but, still, the child can easily wreak havoc. https://www.securecoding.cert.org/confluence/display/seccode/POS33-C.+Do+not+use+vfork( That said, it seems like folks do still use vfork() to get around this problem, e.g.: http://bugs.sun.com/view_bug.do?bug_id=5049299 A daemon would work, given that Hadoop RPC could be used as the protocol for talking to it, the alternative being (usually) something serialized over stdin and stdout. There are some serious security risks associated with having an OS-services daemon listening on a network port. By decoupling it you would even be able to deal with memory leak issues in any embedded libraries, just restart the daemon every few hours.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||