Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2198

Allow FairScheduler to control the number of slots on each TaskTracker

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: contrib/fair-share
    • Labels:
      None

      Description

      We can set the number of slots on the TaskTracker to be high and let FairScheduler handles the slots.
      This approach allows us to change the number of slots on each node dynamically.
      The administrator can change the number of slots with a CLI tool.

      One use case of this is for upgrading the MapReduce.
      Instead of restarting the cluster, we can run the new MapReduce on the same cluster.
      And use the CLI tool to gradually migrate the slots.
      This way we don't lost the progress fo the jobs that's already executed.

      1. MAPREDUCE-2198-v2.txt
        31 kB
        Scott Chen
      2. MAPREDUCE-2198.txt
        28 kB
        Scott Chen

        Issue Links

          Activity

          Scott Chen created issue -
          Scott Chen made changes -
          Field Original Value New Value
          Link This issue relates to MAPREDUCE-2108 [ MAPREDUCE-2108 ]
          Hide
          M. C. Srivas added a comment -

          The scheduler can easily choose to ignore the slot information while scheduling, correct? Then why is this needed? Are we running two TT's on each node, that talk to different JT's, and "migrating" slots from one to the other?

          Show
          M. C. Srivas added a comment - The scheduler can easily choose to ignore the slot information while scheduling, correct? Then why is this needed? Are we running two TT's on each node, that talk to different JT's, and "migrating" slots from one to the other?
          Hide
          dhruba borthakur added a comment -

          > Are we running two TT's on each node, that talk to different JT's, and "migrating" slots from one to the other?

          Precisely. We have a solution that is used to deploy new software to the JT-TT. Earlier, the JT/TT have to be shutdown, new code deployed, and then the cluster is restarted. This means that the cluster was unavailable for a while and currently running jobs all fails when the cluster is shutdown.

          The modified approach is to direct all new jobs to the newly created JT instance, and then slowly (and proportionally) migrate slots from the old JT instance to the new JT instance. This allows JT software upgrades without incurring any cluster downtime.

          Show
          dhruba borthakur added a comment - > Are we running two TT's on each node, that talk to different JT's, and "migrating" slots from one to the other? Precisely. We have a solution that is used to deploy new software to the JT-TT. Earlier, the JT/TT have to be shutdown, new code deployed, and then the cluster is restarted. This means that the cluster was unavailable for a while and currently running jobs all fails when the cluster is shutdown. The modified approach is to direct all new jobs to the newly created JT instance, and then slowly (and proportionally) migrate slots from the old JT instance to the new JT instance. This allows JT software upgrades without incurring any cluster downtime.
          Hide
          Scott Chen added a comment -

          Hey M.C.

          Yes, we can set a higher slot limit on TT and let scheduler manage the slots.

          Are we running two TT's on each node, that talk to different JT's, and "migrating" slots from one to the other?

          Yes. The motivation here is that when deploying new JT and TT. We need to restart the cluster and we lose all the running jobs.
          This can be solved by the way you described.

          Other use case is that people can experiment with the best slot settings by using the CLI without restarting the cluster.
          Right now if you want to change the number of slots, you have to change the conf on every TT and restart.

          Scott

          Show
          Scott Chen added a comment - Hey M.C. Yes, we can set a higher slot limit on TT and let scheduler manage the slots. Are we running two TT's on each node, that talk to different JT's, and "migrating" slots from one to the other? Yes. The motivation here is that when deploying new JT and TT. We need to restart the cluster and we lose all the running jobs. This can be solved by the way you described. Other use case is that people can experiment with the best slot settings by using the CLI without restarting the cluster. Right now if you want to change the number of slots, you have to change the conf on every TT and restart. Scott
          Hide
          Joydeep Sen Sarma added a comment -

          what is fairscheduler specific about this? could we not make this a change in the JT directly to scale the slots advertised by TT?

          in addition - one bug/feature that we need to fix as part of this where the JT overschedules TTs (when configured to schedule multiple tasks per hbt). this is benign today (TT puts those tasks in unassigned state) - but in this world will not be so benign.

          Show
          Joydeep Sen Sarma added a comment - what is fairscheduler specific about this? could we not make this a change in the JT directly to scale the slots advertised by TT? in addition - one bug/feature that we need to fix as part of this where the JT overschedules TTs (when configured to schedule multiple tasks per hbt). this is benign today (TT puts those tasks in unassigned state) - but in this world will not be so benign.
          Hide
          Arun C Murthy added a comment -

          Right now if you want to change the number of slots, you have to change the conf on every TT and restart.

          How do you handle heterogeneous clusters? Or will your CLI command be per TT?

          Also, as Joydeep pointed out there are several issues with unassigned slots in TTs...

          Show
          Arun C Murthy added a comment - Right now if you want to change the number of slots, you have to change the conf on every TT and restart. How do you handle heterogeneous clusters? Or will your CLI command be per TT? Also, as Joydeep pointed out there are several issues with unassigned slots in TTs...
          Hide
          Arun C Murthy added a comment -

          Also, as Joydeep pointed out there are several issues with unassigned slots in TTs...

          You also have to worry about piggy-backing of task-cleanup tasks done by the JT...

          Show
          Arun C Murthy added a comment - Also, as Joydeep pointed out there are several issues with unassigned slots in TTs... You also have to worry about piggy-backing of task-cleanup tasks done by the JT...
          Hide
          Scott Chen added a comment -

          Hey Arun and Joydeep,

          How do you handle heterogeneous clusters? Or will your CLI command be per TT?

          Yes, the CLI command will be per TT.

          in addition - one bug/feature that we need to fix as part of this where the JT overschedules TTs (when configured to schedule multiple tasks per hbt). this is benign today (TT puts those tasks in unassigned state) - but in this world will not be so benign.

          Yes, we should never let JT submit more tasks than TT's limit. Scheduler should incorporate this logic.

          You also have to worry about piggy-backing of task-cleanup tasks done by the JT...

          I am not very clear about this part. Will read more codes.

          Show
          Scott Chen added a comment - Hey Arun and Joydeep, How do you handle heterogeneous clusters? Or will your CLI command be per TT? Yes, the CLI command will be per TT. in addition - one bug/feature that we need to fix as part of this where the JT overschedules TTs (when configured to schedule multiple tasks per hbt). this is benign today (TT puts those tasks in unassigned state) - but in this world will not be so benign. Yes, we should never let JT submit more tasks than TT's limit. Scheduler should incorporate this logic. You also have to worry about piggy-backing of task-cleanup tasks done by the JT... I am not very clear about this part. Will read more codes.
          Hide
          Scott Chen added a comment -

          Hey Arun,

          You also have to worry about piggy-backing of task-cleanup tasks done by the JT...

          I read some codes. Now I see your point.
          JobTracker.getSetupAndCleanupTasks() is not controlled by the Scheduler.
          So we need to put some logic there to make it aware of this task limit. Or we might be over-scheduled.

          Show
          Scott Chen added a comment - Hey Arun, You also have to worry about piggy-backing of task-cleanup tasks done by the JT... I read some codes. Now I see your point. JobTracker.getSetupAndCleanupTasks() is not controlled by the Scheduler. So we need to put some logic there to make it aware of this task limit. Or we might be over-scheduled.
          Hide
          Arun C Murthy added a comment -

          Right. Also, you need to worry about task-cleanup-tasks i.e. responses to move tasks from COMMIT_PENDING to SUCCESS/KILLED.

          Show
          Arun C Murthy added a comment - Right. Also, you need to worry about task-cleanup-tasks i.e. responses to move tasks from COMMIT_PENDING to SUCCESS/KILLED.
          Hide
          Matei Zaharia added a comment -

          Do setup and cleanup tasks do a significant amount of work? I can see two simple ways of dealing with them other than moving control to the scheduler: Either always allow them to run (even if the TT is full on map and reduce slots), or allow them to run but limit the number of such tasks per node to 1 (i.e. have a "setup slot"). If they do need to do CPU-intensive work, then it makes sense to give control of them to the scheduler.

          Show
          Matei Zaharia added a comment - Do setup and cleanup tasks do a significant amount of work? I can see two simple ways of dealing with them other than moving control to the scheduler: Either always allow them to run (even if the TT is full on map and reduce slots), or allow them to run but limit the number of such tasks per node to 1 (i.e. have a "setup slot"). If they do need to do CPU-intensive work, then it makes sense to give control of them to the scheduler.
          Hide
          Matei Zaharia added a comment -

          One other minor comment: if the fair scheduler is over-scheduling task trackers, maybe we should consider that a bug. I don't think it was intended to do that, although in trunk at least, it looks like it may do it if mapAssignCap and reduceAssignCap are set to something less than infinity. (Otherwise, it looks at the number of slots free on the TT and does not assign more than that.) To deal with any sort of race condition that occurs if you lower a slot count while a heartbeat is in progress, I'd suggest making the TT report over-scheduled tasks as killed and drop them.

          Show
          Matei Zaharia added a comment - One other minor comment: if the fair scheduler is over-scheduling task trackers, maybe we should consider that a bug. I don't think it was intended to do that, although in trunk at least, it looks like it may do it if mapAssignCap and reduceAssignCap are set to something less than infinity. (Otherwise, it looks at the number of slots free on the TT and does not assign more than that.) To deal with any sort of race condition that occurs if you lower a slot count while a heartbeat is in progress, I'd suggest making the TT report over-scheduled tasks as killed and drop them.
          Scott Chen made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Scott Chen made changes -
          Attachment MAPREDUCE-2198.txt [ 12460482 ]
          Scott Chen made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Scott Chen made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Scott Chen made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Scott Chen made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Affects Version/s 0.23.0 [ 12315570 ]
          Affects Version/s 0.22.0 [ 12314184 ]
          Fix Version/s 0.23.0 [ 12315570 ]
          Fix Version/s 0.22.0 [ 12314184 ]
          Scott Chen made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Scott Chen made changes -
          Attachment MAPREDUCE-2198.txt [ 12460484 ]
          Scott Chen made changes -
          Attachment MAPREDUCE-2198.txt [ 12460482 ]
          Hide
          Scott Chen added a comment -

          The patch is ready for review.
          Matei: Would you mind help me review this one?

          Show
          Scott Chen added a comment - The patch is ready for review. Matei: Would you mind help me review this one?
          Scott Chen made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Scott Chen added a comment -

          I tried using review board. But I always get 500 error.

          Show
          Scott Chen added a comment - I tried using review board. But I always get 500 error.
          Hide
          Matei Zaharia added a comment -

          The approach in the patch looks good, but I have two questions:

          • How can you run the FairSchedulerShell from the command line? It doesn't seem to have a main method (so just using bin/hadoop org.apache.hadoop.mapred.FairSchedulerShell doesn't work), and I don't see it registered as a tool anywhere.
          • Will the slot counts set by the scheduler be visible in the JobTracker web UI? It looks like jobtracker.jsp looks at ClusterMetrics and machines.jsp looks at TaskTrackerStatus objects.
          Show
          Matei Zaharia added a comment - The approach in the patch looks good, but I have two questions: How can you run the FairSchedulerShell from the command line? It doesn't seem to have a main method (so just using bin/hadoop org.apache.hadoop.mapred.FairSchedulerShell doesn't work), and I don't see it registered as a tool anywhere. Will the slot counts set by the scheduler be visible in the JobTracker web UI? It looks like jobtracker.jsp looks at ClusterMetrics and machines.jsp looks at TaskTrackerStatus objects.
          Hide
          Scott Chen added a comment -

          Hey Matei, Thanks for the review.

          How can you run the FairSchedulerShell from the command line? It doesn't seem to have a main method (so just using bin/hadoop org.apache.hadoop.mapred.FairSchedulerShell doesn't work), and I don't see it registered as a tool anywhere.

          It's my bad. I forgot to put a main method in FairSchedulerShell. I will update it.

          Will the slot counts set by the scheduler be visible in the JobTracker web UI? It looks like jobtracker.jsp looks at ClusterMetrics and machines.jsp looks at TaskTrackerStatus objects.

          For the ClusterMetircs, it's OK because we use TaskScheduler.getMaxSlots() to calculate the total slots in JobTracker. But you are right about machines.jsp. I should change it so that it also pulls the information from scheduler.

          Show
          Scott Chen added a comment - Hey Matei, Thanks for the review. How can you run the FairSchedulerShell from the command line? It doesn't seem to have a main method (so just using bin/hadoop org.apache.hadoop.mapred.FairSchedulerShell doesn't work), and I don't see it registered as a tool anywhere. It's my bad. I forgot to put a main method in FairSchedulerShell. I will update it. Will the slot counts set by the scheduler be visible in the JobTracker web UI? It looks like jobtracker.jsp looks at ClusterMetrics and machines.jsp looks at TaskTrackerStatus objects. For the ClusterMetircs, it's OK because we use TaskScheduler.getMaxSlots() to calculate the total slots in JobTracker. But you are right about machines.jsp. I should change it so that it also pulls the information from scheduler.
          Hide
          Scott Chen added a comment -

          Addressed Matie's comments.

          Show
          Scott Chen added a comment - Addressed Matie's comments.
          Scott Chen made changes -
          Attachment MAPREDUCE-2198-v2.txt [ 12464776 ]
          Hide
          Matei Zaharia added a comment -

          The changes look good, but I thought about one other issue: What should we do when we are asked to lower the slots on a node to below the number of running tasks on it? In the current version, the scheduler won't launch tasks on that node until its running task count falls below its slot count. However, if we wanted to use this for rollover, we'd probably want to wait until enough of those tasks are done before giving a slot to the new JobTracker. There are two ways we can do this: Either have the process that's scaling down the cluster watch the running tasks before giving the slots to someone else, or include an API that somehow makes a callback when the number of running tasks has decreased below the target slot count. What are your thoughts on this?

          One other thing we may want to support is killing tasks after a timeout if the cluster hasn't scaled down. However, I think this can already be done through the MRAdmin shell command / API.

          In either case, we probably need some API to see what's running on the cluster. Some of the commands in MRAdmin might be enough, but we may want to add something there. However, this can be a different JIRA.

          Show
          Matei Zaharia added a comment - The changes look good, but I thought about one other issue: What should we do when we are asked to lower the slots on a node to below the number of running tasks on it? In the current version, the scheduler won't launch tasks on that node until its running task count falls below its slot count. However, if we wanted to use this for rollover, we'd probably want to wait until enough of those tasks are done before giving a slot to the new JobTracker. There are two ways we can do this: Either have the process that's scaling down the cluster watch the running tasks before giving the slots to someone else, or include an API that somehow makes a callback when the number of running tasks has decreased below the target slot count. What are your thoughts on this? One other thing we may want to support is killing tasks after a timeout if the cluster hasn't scaled down. However, I think this can already be done through the MRAdmin shell command / API. In either case, we probably need some API to see what's running on the cluster. Some of the commands in MRAdmin might be enough, but we may want to add something there. However, this can be a different JIRA.
          Hide
          dhruba borthakur added a comment -

          > Either have the process that's scaling down the cluster watch the running tasks before giving the slots to someone else

          I like this idea (instead of having callbacks, keeps the design simple)

          Show
          dhruba borthakur added a comment - > Either have the process that's scaling down the cluster watch the running tasks before giving the slots to someone else I like this idea (instead of having callbacks, keeps the design simple)
          Hide
          Scott Chen added a comment -

          However, I think this can already be done through the MRAdmin shell command / API.

          I also prefer using the existing interface and keep this one simpler.

          Show
          Scott Chen added a comment - However, I think this can already be done through the MRAdmin shell command / API. I also prefer using the existing interface and keep this one simpler.
          Hide
          Joydeep Sen Sarma added a comment -

          +1 on Matei's previous comments about needing to wait until slots are actually released. We should have another api to request the actual number of [map/reduce] slots in use on any given tracker and only claim slots when they are actually confirmed to be released.

          Show
          Joydeep Sen Sarma added a comment - +1 on Matei's previous comments about needing to wait until slots are actually released. We should have another api to request the actual number of [map/reduce] slots in use on any given tracker and only claim slots when they are actually confirmed to be released.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12464776/MAPREDUCE-2198-v2.txt
          against trunk revision 1074251.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/51//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/51//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/51//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12464776/MAPREDUCE-2198-v2.txt against trunk revision 1074251. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/51//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/51//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/51//console This message is automatically generated.
          Hide
          Arun C Murthy added a comment -

          Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks.

          Given this is not an issue with MRv2 should we still commit this? I'm happy to, but not sure it's useful. Thanks.

          Show
          Arun C Murthy added a comment - Sorry to come in late, the patch has gone stale. Can you please rebase? Thanks. Given this is not an issue with MRv2 should we still commit this? I'm happy to, but not sure it's useful. Thanks.
          Arun C Murthy made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          Scott Chen added a comment -

          Arun: Thanks for the comments. You are right. I guess this is not an issue since we have MRv2. Closing this now.

          Show
          Scott Chen added a comment - Arun: Thanks for the comments. You are right. I guess this is not an issue since we have MRv2. Closing this now.
          Scott Chen made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Won't Fix [ 2 ]
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Scott Chen
              Reporter:
              Scott Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development