Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-453

Provide more flexibility in the way tasks are run

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The aim
      With HADOOP-3421 speaking about sharing a cluster among more than one organization (so potentially with non-cooperative users), and posts on the ML speaking about virtualization and the ability to re-use the TaskTracker's VM to run new tasks, it could be useful for admins to choose the way TaskRunners run their children.

      More specifically, it could be useful to provide a way to imprison a Task in its working directory, or in a virtual machine.
      In some cases, reusing the VM might be useful, since it seems that this feature is really wanted (HADOOP-249).

      Concretely
      What I propose is a new class, called called SeperateVMTaskWrapper which contains the current logic for running tasks in another JVM. This class extends another, called TaskWrapper, which could be inherited to provide new ways of running tasks.
      As part of this issue I would also like to provide two other TaskWrappers : the first would run the tasks as Thread of the TaskRunner's VM (if it is possible without too much changes), the second would use a fixed pool of local unix accounts to insulate tasks from each others (so potentially non-cooperating users will be hable to share a cluster, as described in HADOOP-3421).

      1. userBasedInsulator.sh
        2 kB
        Brice Arnould
      2. TaskWrapper_v0.patch
        25 kB
        Brice Arnould

        Issue Links

          Activity

          Hide
          Brice Arnould added a comment -

          First implementation. It is really not ready for use in production, but it seems to works and should illustrate my aim.
          To provide compatibility, it tries to parse options from "mapred.child.java.opts" and to pass them to the TaskWrapper.
          We might consider deprecating "mapred.child.java.opts" and spliting it into a few other options such as "mapred.child.java.properties" and "mapred.child.java.memoryLimit".

          Show
          Brice Arnould added a comment - First implementation. It is really not ready for use in production, but it seems to works and should illustrate my aim. To provide compatibility, it tries to parse options from "mapred.child.java.opts" and to pass them to the TaskWrapper. We might consider deprecating "mapred.child.java.opts" and spliting it into a few other options such as "mapred.child.java.properties" and "mapred.child.java.memoryLimit".
          Hide
          Doug Cutting added a comment -

          In general, I like this approach.

          > As part of this issue I would also like to provide two other TaskWrappers

          Yes, I think the initial patch should provide at least one other implementation, to prove the utility of the API. The thread-per-task approach has often been requested, and is a thus a great candidate.

          A "jail" implemented with 'chroot' that isolates users would also be very useful. If a new root directory is created per user then we should not need more than one additional uid. The tasktracker's uid would need sudo privledges in order to run 'chroot', so we would want to run user tasks as a different uid, but all user tasks could run as the same uid, but each with a different root filesystem. However such a capability might better be added in a separate issue...

          > We might consider deprecating "mapred.child.java.opts"

          If the TaskWrapper implementation is passed the Configuration, can't this property continue to be used by SeperateVMTaskWrapper?

          Show
          Doug Cutting added a comment - In general, I like this approach. > As part of this issue I would also like to provide two other TaskWrappers Yes, I think the initial patch should provide at least one other implementation, to prove the utility of the API. The thread-per-task approach has often been requested, and is a thus a great candidate. A "jail" implemented with 'chroot' that isolates users would also be very useful. If a new root directory is created per user then we should not need more than one additional uid. The tasktracker's uid would need sudo privledges in order to run 'chroot', so we would want to run user tasks as a different uid, but all user tasks could run as the same uid, but each with a different root filesystem. However such a capability might better be added in a separate issue... > We might consider deprecating "mapred.child.java.opts" If the TaskWrapper implementation is passed the Configuration, can't this property continue to be used by SeperateVMTaskWrapper?
          Hide
          Steve Loughran added a comment -

          An in-VM task runner should always run the task in a new security manager (or at least try to set a new security manager, and fail gracefully if it can't, in case someone else is running Hadoop under a new Security Manager already). The SM could block calls to System.exit()

          The reason for doing this from the outset is if you leave it out, it sets up expectations that are hard to change later on. The rule should be in-VM == under a security manager.

          Show
          Steve Loughran added a comment - An in-VM task runner should always run the task in a new security manager (or at least try to set a new security manager, and fail gracefully if it can't, in case someone else is running Hadoop under a new Security Manager already). The SM could block calls to System.exit() The reason for doing this from the outset is if you leave it out, it sets up expectations that are hard to change later on. The rule should be in-VM == under a security manager.
          Hide
          Brice Arnould added a comment -

          @Doug Cutting
          Glad you like it ^^

          I don't think that chroot alone are a good idea, because tasks inside a chroot can still kill or ptrace tasks outside of the chroot. So it would not protect users from each other. FreeBSD jails or Linux vservers would be perfect, but they are non-standard and vservers require to patch the kernel.
          My intention is to provide a TaskWrapper that delegate the security to a user script, and to write a script (which could go in contrib/ ?) that would use a pool of local Unix accounts to run user tasks.
          So, if we have two users (Alice and Bob) whose tasks are going to be run on the same tasktracker, Alice's tasks will be run as the Unix user hadoop0 and Bob's tasks as hadoop1.
          When Alice's tasks are done, her files and process are killed atomically (via kill -PGROUP) to ensure there's nothing left. Then hadoop0 is made available for use by another Hadoop user.
          The benefit of using a separate shell script is that only this script (not the whole TaskTracker) needs root privileges. And it can get them via sudo (so we don't require yet another SUID binary).
          An administrator wanting to use this would :

          1. Deploy hadoop as usual
          2. Create Unix accounts hadoopUser0...hadoopUserN for use by this wrapper
          3. Add in /etc/sudoers a permission for the hadoop user to run the wrapper script as root
          4. Set the right wrapper in Hadoop config

          The attached script demonstrate the process. If there is shell guru available, I would really like his advices ^^.
          We could also write a script that run tasks inside a VM, but I'm unsure that it is useful, considering the overhead.

          If the TaskWrapper implementation is passed the Configuration, can't this property continue to be used by SeperateVMTaskWrapper?

          You're right, it would be the best way to ensure compatibility. For now I will continue to use parameters set by setMaximumMemory(), addArg() and so on, in order to test the API. But the "release" version of SeperateVMTaskWrapper will directly use mapred.child.java.opts.

          @Steve Loughran

          An in-VM task runner should always run the task in a new security manager

          Good idea ! For some TaskWrapper we might be unhable to provide a true security (I think mainly to the ThreadWrapper that I'm writing, which is mainly intended to be used with the Streaming API), but the programmer should at least be protected against most obvious errors.
          On the ML, Alejandro Abdelnur said that he already run his tasks under a security manager, I'm going to ask him if he can publish more information that I could integrate into a TaskWrapper.
          You're also right about the fact that failing gracefully is very important, since some TaskWrapper might not be able to run all tasks. My next proposition will try to take that in account.

          Thanks for your comments !

          Show
          Brice Arnould added a comment - @Doug Cutting Glad you like it ^^ I don't think that chroot alone are a good idea, because tasks inside a chroot can still kill or ptrace tasks outside of the chroot. So it would not protect users from each other. FreeBSD jails or Linux vservers would be perfect, but they are non-standard and vservers require to patch the kernel. My intention is to provide a TaskWrapper that delegate the security to a user script, and to write a script (which could go in contrib/ ?) that would use a pool of local Unix accounts to run user tasks. So, if we have two users (Alice and Bob) whose tasks are going to be run on the same tasktracker, Alice's tasks will be run as the Unix user hadoop0 and Bob's tasks as hadoop1 . When Alice's tasks are done, her files and process are killed atomically (via kill -PGROUP ) to ensure there's nothing left. Then hadoop0 is made available for use by another Hadoop user. The benefit of using a separate shell script is that only this script (not the whole TaskTracker) needs root privileges. And it can get them via sudo (so we don't require yet another SUID binary). An administrator wanting to use this would : Deploy hadoop as usual Create Unix accounts hadoopUser0 ... hadoopUserN for use by this wrapper Add in /etc/sudoers a permission for the hadoop user to run the wrapper script as root Set the right wrapper in Hadoop config The attached script demonstrate the process. If there is shell guru available, I would really like his advices ^^. We could also write a script that run tasks inside a VM, but I'm unsure that it is useful, considering the overhead. If the TaskWrapper implementation is passed the Configuration, can't this property continue to be used by SeperateVMTaskWrapper? You're right, it would be the best way to ensure compatibility. For now I will continue to use parameters set by setMaximumMemory(), addArg() and so on, in order to test the API. But the "release" version of SeperateVMTaskWrapper will directly use mapred.child.java.opts. @Steve Loughran An in-VM task runner should always run the task in a new security manager Good idea ! For some TaskWrapper we might be unhable to provide a true security (I think mainly to the ThreadWrapper that I'm writing, which is mainly intended to be used with the Streaming API), but the programmer should at least be protected against most obvious errors. On the ML, Alejandro Abdelnur said that he already run his tasks under a security manager, I'm going to ask him if he can publish more information that I could integrate into a TaskWrapper. You're also right about the fact that failing gracefully is very important, since some TaskWrapper might not be able to run all tasks. My next proposition will try to take that in account. Thanks for your comments !
          Hide
          Devaraj Das added a comment -

          I was looking at the patch. Unfortunately, this patch has gone stale. Could you pls regenerate the patch. Alternatively, pls let me know the trunk revision you generated the patch against. Also it seems to me that TaskProcess.java should have been added as part of the patch but the patch file seems to try to delete that (currently) non-existent file.
          Thanks!

          Show
          Devaraj Das added a comment - I was looking at the patch. Unfortunately, this patch has gone stale. Could you pls regenerate the patch. Alternatively, pls let me know the trunk revision you generated the patch against. Also it seems to me that TaskProcess.java should have been added as part of the patch but the patch file seems to try to delete that (currently) non-existent file. Thanks!
          Hide
          Devaraj Das added a comment -

          I am looking at the JVM reuse issue particularly. I think that can be done independently of this jira. My idea is to still have the umbilical protocol for task executions. The difference with the current approach is that that JVM would not exit at the end of the current task execution but the TaskTracker.Child.main would have a loop inside. The crux of the loop would look something like:

          while (true) {
             Task t = umbilical.getTask();
             if (t != null) {
               t.run(umbilical);
             }
          }
          

          If JVM reuse is enabled, the tasktracker would have 1 JVM per job. The tasktracker would kill the JVM process when the corresponding job finishes. If it is already running too many JVMs, it would kill one based on LRU.

          Also, the JVM would kill itself whenever it encounters an exception (a running task on encountering an exception would result in the shutdown of the JVM). That would take care of problems such as a long running JVM running out of memory. The only thing is a task attempt might get penalized for no fault of the task (in the case where the JVM is leaking memory, for example) since that attempt is bound to fail. So this is something to watch out for but I am not too worried about it at this point of time.

          Overall, the above would probably benefit jobs with many short running tasks (like for maps which are short lived).

          I am thinking of submitting the patch with the above for HADOOP-249.

          Thoughts?

          Show
          Devaraj Das added a comment - I am looking at the JVM reuse issue particularly. I think that can be done independently of this jira. My idea is to still have the umbilical protocol for task executions. The difference with the current approach is that that JVM would not exit at the end of the current task execution but the TaskTracker.Child.main would have a loop inside. The crux of the loop would look something like: while ( true ) { Task t = umbilical.getTask(); if (t != null ) { t.run(umbilical); } } If JVM reuse is enabled, the tasktracker would have 1 JVM per job. The tasktracker would kill the JVM process when the corresponding job finishes. If it is already running too many JVMs, it would kill one based on LRU. Also, the JVM would kill itself whenever it encounters an exception (a running task on encountering an exception would result in the shutdown of the JVM). That would take care of problems such as a long running JVM running out of memory. The only thing is a task attempt might get penalized for no fault of the task (in the case where the JVM is leaking memory, for example) since that attempt is bound to fail. So this is something to watch out for but I am not too worried about it at this point of time. Overall, the above would probably benefit jobs with many short running tasks (like for maps which are short lived). I am thinking of submitting the patch with the above for HADOOP-249 . Thoughts?
          Hide
          Hemanth Yamijala added a comment -

          One thing to consider here is resources used per task. For e.g. memory that is used by a task attempt should be freed to as much extent as possible before the next task is run. This way, any per task limits like those in HADOOP-3581 can be enforced correctly.

          Show
          Hemanth Yamijala added a comment - One thing to consider here is resources used per task. For e.g. memory that is used by a task attempt should be freed to as much extent as possible before the next task is run. This way, any per task limits like those in HADOOP-3581 can be enforced correctly.
          Hide
          Steve Loughran added a comment -

          In the ant context In-VM execution of things like Javac and junit was always good for speed, but bad for long term memory consumption, both of the normal heap and of PermGenHeapSpace

          • you can't unload a class until all references are removed
          • you can't unload a classloader until all classes in it are removed
          • its very hard to be sure that classes/classloaders are fully unloaded.
            You also need to be sure that you dont pass down any of your own classes to the code, just those in the JDK, which forces you to add extra code and tests, especially for the com.sun stuff

          Here is some of the code, which is intended to give a hint as to how hard it is, rather than provide a starting point:

          http://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/util/JavaEnvUtils.java?view=markup
          http://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/AntClassLoader.java?view=markup

          Unless you are planning on killing the tasktracker process on a regular basis and restarting it, a more reliable alternative might be for the task tracker to start and communicate with a secondary process that the tasktracker can keep alive, but which lets you start even more work. This is effectively what smartfrog does, using RMI to communicate between processes. Again, classloaders are painful to work with, but you can kill the children more easily. It has proven to work better over long-lived deployments -but took a lot more engineering effort.

          Show
          Steve Loughran added a comment - In the ant context In-VM execution of things like Javac and junit was always good for speed, but bad for long term memory consumption, both of the normal heap and of PermGenHeapSpace you can't unload a class until all references are removed you can't unload a classloader until all classes in it are removed its very hard to be sure that classes/classloaders are fully unloaded. You also need to be sure that you dont pass down any of your own classes to the code, just those in the JDK, which forces you to add extra code and tests, especially for the com.sun stuff Here is some of the code, which is intended to give a hint as to how hard it is, rather than provide a starting point: http://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/util/JavaEnvUtils.java?view=markup http://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/AntClassLoader.java?view=markup Unless you are planning on killing the tasktracker process on a regular basis and restarting it, a more reliable alternative might be for the task tracker to start and communicate with a secondary process that the tasktracker can keep alive, but which lets you start even more work. This is effectively what smartfrog does, using RMI to communicate between processes. Again, classloaders are painful to work with, but you can kill the children more easily. It has proven to work better over long-lived deployments -but took a lot more engineering effort.
          Hide
          Devaraj Das added a comment -

          Hemanth, yes, the memory should be freed up but in the cases where a task is using direct buffers and so on, which makes GC hard, one of the downstream tasks would suffer. But that's not the most typical use case. I hate to make the jvm reuse a per job user configurable option but if I do so, then users can turn this feature off if they see their job behaving erratically. Would that work?
          Steve, actually, the task is run as a separate process. Its just that the same JVM would be reused for running more tasks of the same job (so classloader issue is a non-issue here). So the TT is not affected by this really.

          Show
          Devaraj Das added a comment - Hemanth, yes, the memory should be freed up but in the cases where a task is using direct buffers and so on, which makes GC hard, one of the downstream tasks would suffer. But that's not the most typical use case. I hate to make the jvm reuse a per job user configurable option but if I do so, then users can turn this feature off if they see their job behaving erratically. Would that work? Steve, actually, the task is run as a separate process. Its just that the same JVM would be reused for running more tasks of the same job (so classloader issue is a non-issue here). So the TT is not affected by this really.
          Hide
          Steve Loughran added a comment -

          @Devaraj

          If its just for tasks of the same job, then yes, life is simple: no classloader problems, and when the job is finished it will ultimately die. As long as the job's code doesnt create giant static data structures or do bad things with native libraries, you should be ok.

          Show
          Steve Loughran added a comment - @Devaraj If its just for tasks of the same job, then yes, life is simple: no classloader problems, and when the job is finished it will ultimately die. As long as the job's code doesnt create giant static data structures or do bad things with native libraries, you should be ok.
          Hide
          Doug Cutting added a comment -

          Devaraj: This sounds like a fine design. One JVM is launched per job, so the per-job overhead should be much reduced, with threads in that job per task, right? This should ideally be configurable per-job (rather than per-cluster) and we could consider switching to this by default if performance is considerably faster for common jobs (e.g., sort).

          Show
          Doug Cutting added a comment - Devaraj: This sounds like a fine design. One JVM is launched per job, so the per-job overhead should be much reduced, with threads in that job per task, right? This should ideally be configurable per-job (rather than per-cluster) and we could consider switching to this by default if performance is considerably faster for common jobs (e.g., sort).
          Hide
          Devaraj Das added a comment -

          Doug, I hadn't considered the thread-per-task approach. I was considering sequential executions of tasks (perhaps we would sometimes have more than one JVM for the same job in memory subject to the available free slots). The slots used in the thread-per-task case would be the number of concurrently running threads (read tasks) across all the JVMs, right?
          It does bring in a complication to do with integration with HADOOP-3581 but it should be possible to count how many task slots a JVM is currently using (number of concurrently running tasks), and, factor that in into the resource utilization issue that HADOOP-3581 deals with.
          The other complication is to figure out whether the framework code is clean enough (or threadsafe) that multiple instances of the Map/Reduce task can be active within one process at any given point of time. Ditto with the application code - can we assume that apps have been written to be thread safe.

          Show
          Devaraj Das added a comment - Doug, I hadn't considered the thread-per-task approach. I was considering sequential executions of tasks (perhaps we would sometimes have more than one JVM for the same job in memory subject to the available free slots). The slots used in the thread-per-task case would be the number of concurrently running threads (read tasks) across all the JVMs, right? It does bring in a complication to do with integration with HADOOP-3581 but it should be possible to count how many task slots a JVM is currently using (number of concurrently running tasks), and, factor that in into the resource utilization issue that HADOOP-3581 deals with. The other complication is to figure out whether the framework code is clean enough (or threadsafe) that multiple instances of the Map/Reduce task can be active within one process at any given point of time. Ditto with the application code - can we assume that apps have been written to be thread safe.
          Hide
          Tom White added a comment -

          Deveraj, This approach looks good to me. One of the challenges is to handle the user logs, which are managed using the shell (since HADOOP-1553). If the JVM is reused for more tasks, how does the shell redirection know about this? If the JVM handled the logging (by redirecting System.out and System.err), then it would be fine, but the reason it was moved to the shell was because of performance (see numbers in HADOOP-1553).

          Show
          Tom White added a comment - Deveraj, This approach looks good to me. One of the challenges is to handle the user logs, which are managed using the shell (since HADOOP-1553 ). If the JVM is reused for more tasks, how does the shell redirection know about this? If the JVM handled the logging (by redirecting System.out and System.err), then it would be fine, but the reason it was moved to the shell was because of performance (see numbers in HADOOP-1553 ).
          Hide
          Doug Cutting added a comment -

          The motivation for re-using the JVM is presumably performance. Since most nodes are multicore these days, it would be a shame if tasks of a job were executed serially on each node, no? So, if you don't intend to use threads, then I think you need to lift the one-jvm-per-job limit. The tasktracker could run instead one jvm per task slot, restarting them when tasks arrive from a different job. Could that work?

          Show
          Doug Cutting added a comment - The motivation for re-using the JVM is presumably performance. Since most nodes are multicore these days, it would be a shame if tasks of a job were executed serially on each node, no? So, if you don't intend to use threads, then I think you need to lift the one-jvm-per-job limit. The tasktracker could run instead one jvm per task slot, restarting them when tasks arrive from a different job. Could that work?
          Hide
          Craig Macdonald added a comment -

          HADOOP-3280 and other places note that longer term Hadoop task tracker will run as root and setuid to the user named in the JobConf.

          Brice, I note that you are using a script to setuid to one of a pool of users.

          My question is whether this patch implements enough to the su to the specified user.

          Show
          Craig Macdonald added a comment - HADOOP-3280 and other places note that longer term Hadoop task tracker will run as root and setuid to the user named in the JobConf. Brice, I note that you are using a script to setuid to one of a pool of users. My question is whether this patch implements enough to the su to the specified user.
          Hide
          Devaraj Das added a comment -

          Doug, yes, as i had stated in my earlier comment, we might have more than one active JVM for a job. I was thinking of having the option to do with max JVMs in memory configurable where the default max could be equal to the #slots. But maybe it makes sense to just limit it to the number of slots.
          Tom, you raise an interesting question on the details. Let me see how that would work.

          Show
          Devaraj Das added a comment - Doug, yes, as i had stated in my earlier comment, we might have more than one active JVM for a job. I was thinking of having the option to do with max JVMs in memory configurable where the default max could be equal to the #slots. But maybe it makes sense to just limit it to the number of slots. Tom, you raise an interesting question on the details. Let me see how that would work.
          Hide
          Devaraj Das added a comment -

          Everyone, pls continue discussion on HADOOP-249 on the specific topic of JVM reuse.

          Show
          Devaraj Das added a comment - Everyone, pls continue discussion on HADOOP-249 on the specific topic of JVM reuse.
          Hide
          Harsh J added a comment -

          HADOOP-3280 and MAPREDUCE-249 covered these ideas already. I am resolving this as a dupe of those alternatives - I do believe the planned feature (minus of the threads-per-task goal) are present in one version or another already today (in different forms).

          For threads per task, if that is still a valid gain beyond what MR2's AM does, can be discussed over a new issue.

          Show
          Harsh J added a comment - HADOOP-3280 and MAPREDUCE-249 covered these ideas already. I am resolving this as a dupe of those alternatives - I do believe the planned feature (minus of the threads-per-task goal) are present in one version or another already today (in different forms). For threads per task, if that is still a valid gain beyond what MR2's AM does, can be discussed over a new issue.
          Hide
          Harsh J added a comment -

          Sorry, previous comment should have said HADOOP-249, not MAPREDUCE-249.

          Show
          Harsh J added a comment - Sorry, previous comment should have said HADOOP-249 , not MAPREDUCE-249 .
          Hide
          Harsh J added a comment -

          And just for ref, MR2 details can be drilled down from via MAPREDUCE-279.

          Show
          Harsh J added a comment - And just for ref, MR2 details can be drilled down from via MAPREDUCE-279 .

            People

            • Assignee:
              Brice Arnould
              Reporter:
              Brice Arnould
            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development