|
In general, I like this approach.
> As part of this issue I would also like to provide two other TaskWrappers Yes, I think the initial patch should provide at least one other implementation, to prove the utility of the API. The thread-per-task approach has often been requested, and is a thus a great candidate. A "jail" implemented with 'chroot' that isolates users would also be very useful. If a new root directory is created per user then we should not need more than one additional uid. The tasktracker's uid would need sudo privledges in order to run 'chroot', so we would want to run user tasks as a different uid, but all user tasks could run as the same uid, but each with a different root filesystem. However such a capability might better be added in a separate issue... > We might consider deprecating "mapred.child.java.opts" If the TaskWrapper implementation is passed the Configuration, can't this property continue to be used by SeperateVMTaskWrapper? An in-VM task runner should always run the task in a new security manager (or at least try to set a new security manager, and fail gracefully if it can't, in case someone else is running Hadoop under a new Security Manager already). The SM could block calls to System.exit()
The reason for doing this from the outset is if you leave it out, it sets up expectations that are hard to change later on. The rule should be in-VM == under a security manager. @Doug Cutting
Glad you like it ^^ I don't think that chroot alone are a good idea, because tasks inside a chroot can still kill or ptrace tasks outside of the chroot. So it would not protect users from each other. FreeBSD jails or Linux vservers would be perfect, but they are non-standard and vservers require to patch the kernel.
The attached script demonstrate the process. If there is shell guru available, I would really like his advices ^^.
You're right, it would be the best way to ensure compatibility. For now I will continue to use parameters set by setMaximumMemory(), addArg() and so on, in order to test the API. But the "release" version of SeperateVMTaskWrapper will directly use mapred.child.java.opts. @Steve Loughran
Good idea ! For some TaskWrapper we might be unhable to provide a true security (I think mainly to the ThreadWrapper that I'm writing, which is mainly intended to be used with the Streaming API), but the programmer should at least be protected against most obvious errors. Thanks for your comments ! I was looking at the patch. Unfortunately, this patch has gone stale. Could you pls regenerate the patch. Alternatively, pls let me know the trunk revision you generated the patch against. Also it seems to me that TaskProcess.java should have been added as part of the patch but the patch file seems to try to delete that (currently) non-existent file.
Thanks! I am looking at the JVM reuse issue particularly. I think that can be done independently of this jira. My idea is to still have the umbilical protocol for task executions. The difference with the current approach is that that JVM would not exit at the end of the current task execution but the TaskTracker.Child.main would have a loop inside. The crux of the loop would look something like:
while (true) { Task t = umbilical.getTask(); if (t != null) { t.run(umbilical); } } If JVM reuse is enabled, the tasktracker would have 1 JVM per job. The tasktracker would kill the JVM process when the corresponding job finishes. If it is already running too many JVMs, it would kill one based on LRU. Also, the JVM would kill itself whenever it encounters an exception (a running task on encountering an exception would result in the shutdown of the JVM). That would take care of problems such as a long running JVM running out of memory. The only thing is a task attempt might get penalized for no fault of the task (in the case where the JVM is leaking memory, for example) since that attempt is bound to fail. So this is something to watch out for but I am not too worried about it at this point of time. Overall, the above would probably benefit jobs with many short running tasks (like for maps which are short lived). I am thinking of submitting the patch with the above for Thoughts? One thing to consider here is resources used per task. For e.g. memory that is used by a task attempt should be freed to as much extent as possible before the next task is run. This way, any per task limits like those in
In the ant context In-VM execution of things like Javac and junit was always good for speed, but bad for long term memory consumption, both of the normal heap and of PermGenHeapSpace
Here is some of the code, which is intended to give a hint as to how hard it is, rather than provide a starting point: http://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/util/JavaEnvUtils.java?view=markup Unless you are planning on killing the tasktracker process on a regular basis and restarting it, a more reliable alternative might be for the task tracker to start and communicate with a secondary process that the tasktracker can keep alive, but which lets you start even more work. This is effectively what smartfrog does, using RMI to communicate between processes. Again, classloaders are painful to work with, but you can kill the children more easily. It has proven to work better over long-lived deployments -but took a lot more engineering effort. Hemanth, yes, the memory should be freed up but in the cases where a task is using direct buffers and so on, which makes GC hard, one of the downstream tasks would suffer. But that's not the most typical use case. I hate to make the jvm reuse a per job user configurable option but if I do so, then users can turn this feature off if they see their job behaving erratically. Would that work?
Steve, actually, the task is run as a separate process. Its just that the same JVM would be reused for running more tasks of the same job (so classloader issue is a non-issue here). So the TT is not affected by this really. @Devaraj
If its just for tasks of the same job, then yes, life is simple: no classloader problems, and when the job is finished it will ultimately die. As long as the job's code doesnt create giant static data structures or do bad things with native libraries, you should be ok. Devaraj: This sounds like a fine design. One JVM is launched per job, so the per-job overhead should be much reduced, with threads in that job per task, right? This should ideally be configurable per-job (rather than per-cluster) and we could consider switching to this by default if performance is considerably faster for common jobs (e.g., sort).
Doug, I hadn't considered the thread-per-task approach. I was considering sequential executions of tasks (perhaps we would sometimes have more than one JVM for the same job in memory subject to the available free slots). The slots used in the thread-per-task case would be the number of concurrently running threads (read tasks) across all the JVMs, right?
It does bring in a complication to do with integration with The other complication is to figure out whether the framework code is clean enough (or threadsafe) that multiple instances of the Map/Reduce task can be active within one process at any given point of time. Ditto with the application code - can we assume that apps have been written to be thread safe. Deveraj, This approach looks good to me. One of the challenges is to handle the user logs, which are managed using the shell (since
The motivation for re-using the JVM is presumably performance. Since most nodes are multicore these days, it would be a shame if tasks of a job were executed serially on each node, no? So, if you don't intend to use threads, then I think you need to lift the one-jvm-per-job limit. The tasktracker could run instead one jvm per task slot, restarting them when tasks arrive from a different job. Could that work?
Brice, I note that you are using a script to setuid to one of a pool of users. My question is whether this patch implements enough to the su to the specified user. Doug, yes, as i had stated in my earlier comment, we might have more than one active JVM for a job. I was thinking of having the option to do with max JVMs in memory configurable where the default max could be equal to the #slots. But maybe it makes sense to just limit it to the number of slots.
Tom, you raise an interesting question on the details. Let me see how that would work. Everyone, pls continue discussion on
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To provide compatibility, it tries to parse options from "mapred.child.java.opts" and to pass them to the TaskWrapper.
We might consider deprecating "mapred.child.java.opts" and spliting it into a few other options such as "mapred.child.java.properties" and "mapred.child.java.memoryLimit".