Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      It would be nice to be able to pause (and subsequently resume) tasks that are currently running, in order to allow tasks from higher priority jobs to execute. At present it is quite easy for long-running tasks from low priority jobs to block a task from a newer high priority job, and there is no way to force the execution of the high priority task without killing the low priority jobs.

        Issue Links

          Activity

          Hide
          md87 Chris Smith added a comment -

          I've started work on a way to do this. At the minute it allows reduce tasks to be manually paused and resumed from the command line. The paused/unpaused state of a task is included as a second boolean in the return value of the statusUpdate/ping methods of the umbilical protocol, which isn't particularly elegant but works well enough. Paused tasks sit in a sleep loop waiting to be unpaused (this obviously means that they're consuming resources while not actually doing anything useful — the ideal solution would be to store them somewhere and resume later, as described in HADOOP-91).

          Suggestions for improvements/comments welcome...

          Show
          md87 Chris Smith added a comment - I've started work on a way to do this. At the minute it allows reduce tasks to be manually paused and resumed from the command line. The paused/unpaused state of a task is included as a second boolean in the return value of the statusUpdate/ping methods of the umbilical protocol, which isn't particularly elegant but works well enough. Paused tasks sit in a sleep loop waiting to be unpaused (this obviously means that they're consuming resources while not actually doing anything useful — the ideal solution would be to store them somewhere and resume later, as described in HADOOP-91 ). Suggestions for improvements/comments welcome...
          Hide
          amar_kamat Amar Kamat added a comment - - edited

          Chris,
          We been thinking on this for sometime. This problem will be more visible once the new scheduler comes in since that will have the pre-emption feature.

          From our offline discussions it makes more sense to suspend/resume reduce tasks. Since on an average the reducers run for longer time and mostly determine the job runtime. Its also easier to suspend reducers as one can always save the shuffled data and restart the REDUCE phase. Saving shuffle data might be a huge gain but again there are issues with resources getting wasted and clean-up.

          With maps its difficult since mostly the maps run faster and the maps have just one phase i.e the MAP phase. When the map task runs following are the things that determines its state
          1) The offset in the input that is read
          2) The mapped <k,v> in the memory
          3) The data spilled to disk
          4) External connections
          One could probably optimise by using what is already spilled and move to the offset on restart/resume but its not clear how much gain this will give and if at all there are any use-cases that strongly demand this.

          Holding tasks in memory (i.e the pause) might not be scalable. Hence we should think on suspend-to-fs/resume. As rightly pointed out by Vivek (offline) that its not guaranteed that the job/org will get the same set of nodes back and hence saving this state on disk might not make sense. Saving to dfs will be a huge hit. Thoughts? Comments?

          Show
          amar_kamat Amar Kamat added a comment - - edited Chris, We been thinking on this for sometime. This problem will be more visible once the new scheduler comes in since that will have the pre-emption feature. From our offline discussions it makes more sense to suspend/resume reduce tasks. Since on an average the reducers run for longer time and mostly determine the job runtime. Its also easier to suspend reducers as one can always save the shuffled data and restart the REDUCE phase. Saving shuffle data might be a huge gain but again there are issues with resources getting wasted and clean-up. With maps its difficult since mostly the maps run faster and the maps have just one phase i.e the MAP phase. When the map task runs following are the things that determines its state 1) The offset in the input that is read 2) The mapped <k,v> in the memory 3) The data spilled to disk 4) External connections One could probably optimise by using what is already spilled and move to the offset on restart/resume but its not clear how much gain this will give and if at all there are any use-cases that strongly demand this. Holding tasks in memory (i.e the pause) might not be scalable. Hence we should think on suspend-to-fs/resume. As rightly pointed out by Vivek (offline) that its not guaranteed that the job/org will get the same set of nodes back and hence saving this state on disk might not make sense. Saving to dfs will be a huge hit. Thoughts? Comments?
          Hide
          owen.omalley Owen O'Malley added a comment -

          I don't think keeping the tasks in memory is feasible. They would at the very least need to be written to disk, and even that would be hard given the number of threads that we use in the framework to increase parallelism. In the current framework, you might make a command that lowers the priority of a job and kills any task that is still running N minutes later. That would be easy to do and have the right effect, wouldn't it?

          Show
          owen.omalley Owen O'Malley added a comment - I don't think keeping the tasks in memory is feasible. They would at the very least need to be written to disk, and even that would be hard given the number of threads that we use in the framework to increase parallelism. In the current framework, you might make a command that lowers the priority of a job and kills any task that is still running N minutes later. That would be easy to do and have the right effect, wouldn't it?
          Hide
          md87 Chris Smith added a comment -

          I agree that suspending/resuming reducers makes more sense than map tasks. I've only implemented the pause logic in reducers at the minute, and (assuming that it works well enough) I doubt there will be any need to add it to mappers.

          As for keeping tasks in memory, it's obviously nowhere near ideal, but I view this more as an interim solution until HADOOP-91 is resolved; I'm thinking that with a bit of end-user diligence (ensuring that all high(er) priority jobs were queued before the low(er) one is paused) we could probably get away with having at most one paused task per node. Whether or not that will be problematic wrt memory usage obviously depends on the tasks/hardware/config, and is something I'll have to look at. If keeping them in memory proves not to be feasible then I'll obviously have to concentrate on suspending to the [d]fs.

          Owen: unfortunately a command that results in tasks being killed wouldn't really have the effect I'm after; we don't want to waste the work that's been done by the (possibly long-running) lower-priority tasks. At present the tasks are killed manually to make way for the high-priority job — the idea of pausing is to be able to preserve the progress that's already been made on those tasks until the high-pri job is out of the way.

          Show
          md87 Chris Smith added a comment - I agree that suspending/resuming reducers makes more sense than map tasks. I've only implemented the pause logic in reducers at the minute, and (assuming that it works well enough) I doubt there will be any need to add it to mappers. As for keeping tasks in memory, it's obviously nowhere near ideal, but I view this more as an interim solution until HADOOP-91 is resolved; I'm thinking that with a bit of end-user diligence (ensuring that all high(er) priority jobs were queued before the low(er) one is paused) we could probably get away with having at most one paused task per node. Whether or not that will be problematic wrt memory usage obviously depends on the tasks/hardware/config, and is something I'll have to look at. If keeping them in memory proves not to be feasible then I'll obviously have to concentrate on suspending to the [d] fs. Owen: unfortunately a command that results in tasks being killed wouldn't really have the effect I'm after; we don't want to waste the work that's been done by the (possibly long-running) lower-priority tasks. At present the tasks are killed manually to make way for the high-priority job — the idea of pausing is to be able to preserve the progress that's already been made on those tasks until the high-pri job is out of the way.
          Hide
          amar_kamat Amar Kamat added a comment -

          In the current framework, you might make a command that lowers the priority of a job and kills any task that is still running N minutes later. That would be easy to do and have the right effect, wouldn't it?

          Wouldn't HADOOP-3444 introduce this (pre-emption)? Unless we decide to implement pre-emption earlier and modify it to cater the scheduling needs.

          Show
          amar_kamat Amar Kamat added a comment - In the current framework, you might make a command that lowers the priority of a job and kills any task that is still running N minutes later. That would be easy to do and have the right effect, wouldn't it? Wouldn't HADOOP-3444 introduce this (pre-emption)? Unless we decide to implement pre-emption earlier and modify it to cater the scheduling needs.
          Hide
          md87 Chris Smith added a comment -

          Attached a patch of my current progress on this issue. It defines a new job priority (

          {PAUSED}

          ), which prevents new reducers from being started, and pauses existing reduces. You can also pause individual (reduce) tasks via the command line or web ui. Paused tasks (from non-paused jobs) are resumed when their tracker requests new work and there are no higher-priority tasks waiting.

          The communication between the TaskTracker and in-progress tasks works by replacing the boolean response to ping/updateStatus in the TaskUmbilicalProtocol with a TaskPingResponse object which specifies both whether the task is known by the tracker, and whether it is paused or not. Once a task is paused, it sits in a sleep loop waiting to be unpaused.

          As previously mentioned, paused tasks are kept in memory, so there's an obvious limit on how much you can pause. We're currently testing the patch on a cluster to see whether or not this is problematic in practice.

          Comments/suggestions welcome!

          Show
          md87 Chris Smith added a comment - Attached a patch of my current progress on this issue. It defines a new job priority ( {PAUSED} ), which prevents new reducers from being started, and pauses existing reduces. You can also pause individual (reduce) tasks via the command line or web ui. Paused tasks (from non-paused jobs) are resumed when their tracker requests new work and there are no higher-priority tasks waiting. The communication between the TaskTracker and in-progress tasks works by replacing the boolean response to ping/updateStatus in the TaskUmbilicalProtocol with a TaskPingResponse object which specifies both whether the task is known by the tracker, and whether it is paused or not. Once a task is paused, it sits in a sleep loop waiting to be unpaused. As previously mentioned, paused tasks are kept in memory, so there's an obvious limit on how much you can pause. We're currently testing the patch on a cluster to see whether or not this is problematic in practice. Comments/suggestions welcome!
          Hide
          ab Andrzej Bialecki added a comment -

          I think it would be good to add to the Task interface something like onPause() / onUnpause() methods. This way map / reduce tasks to be paused could prepare for this event (e.g. close DB conections or close the side-effect files), and similarly restore their state on un-pause.

          Show
          ab Andrzej Bialecki added a comment - I think it would be good to add to the Task interface something like onPause() / onUnpause() methods. This way map / reduce tasks to be paused could prepare for this event (e.g. close DB conections or close the side-effect files), and similarly restore their state on un-pause.
          Hide
          vivekr Vivek Ratan added a comment -

          I'm wondering if you really need this feature, now that we have pluggable schedulers (HADOOP-3412, HADOOP-3444, HADOOP-3746). Schedulers decide what is the best task to run, given priorities and resources, and other constraints. If you're concerned about low-priority jobs blocking resources for higher-priority jobs that are submitted later, you may want to look at one of these schedulers, or write your own that deals with the situation. Granted that you still won't be able to pause tasks, but you may not need to do that if your scheduler works correctly. It may start running a task from a low-running job, and then assign a task from the higher-priority job to the next available TT. Yes, you can reach a situation where tasks from the lower-priority job are consuming all resources and all long-running, so the higher-priority job is waiting,but there are ways to counter that. HADOOP-3444 provides capacities and user limits, along with preemption, to prevent one user or one job from taking over the system.

          I think supporting pausing tasks is non-trivial so you want to make sure there is a real need for it.

          Show
          vivekr Vivek Ratan added a comment - I'm wondering if you really need this feature, now that we have pluggable schedulers ( HADOOP-3412 , HADOOP-3444 , HADOOP-3746 ). Schedulers decide what is the best task to run, given priorities and resources, and other constraints. If you're concerned about low-priority jobs blocking resources for higher-priority jobs that are submitted later, you may want to look at one of these schedulers, or write your own that deals with the situation. Granted that you still won't be able to pause tasks, but you may not need to do that if your scheduler works correctly. It may start running a task from a low-running job, and then assign a task from the higher-priority job to the next available TT. Yes, you can reach a situation where tasks from the lower-priority job are consuming all resources and all long-running, so the higher-priority job is waiting,but there are ways to counter that. HADOOP-3444 provides capacities and user limits, along with preemption, to prevent one user or one job from taking over the system. I think supporting pausing tasks is non-trivial so you want to make sure there is a real need for it.
          Hide
          matei@eecs.berkeley.edu Matei Zaharia added a comment -

          The really hard challenge with pausing, in my opinion, will be how to decide when to resume the tasks or when to kill them. It's not clear that if you pause a task on some machine, you'll get the opportunity to run it again. In fact, maybe another machine becomes free and you'd be better off running the task on that one. So the whole scheduling problem becomes more difficult.

          Another fix that we really have to strive for is making reduces smaller, e.g. by separating the copy phase into its own set of tasks (Joydeep has posted some comments on this in the MapReduce 2.0 discussion).

          Show
          matei@eecs.berkeley.edu Matei Zaharia added a comment - The really hard challenge with pausing, in my opinion, will be how to decide when to resume the tasks or when to kill them. It's not clear that if you pause a task on some machine, you'll get the opportunity to run it again. In fact, maybe another machine becomes free and you'd be better off running the task on that one. So the whole scheduling problem becomes more difficult. Another fix that we really have to strive for is making reduces smaller, e.g. by separating the copy phase into its own set of tasks (Joydeep has posted some comments on this in the MapReduce 2.0 discussion).

            People

            • Assignee:
              md87 Chris Smith
              Reporter:
              md87 Chris Smith
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development