Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Introduced Fair Scheduler.

      Description

      The default job scheduler in Hadoop has a first-in-first-out queue of jobs for each priority level. The scheduler always assigns task slots to the first job in the highest-level priority queue that is in need of tasks. This makes it difficult to share a MapReduce cluster between users because a large job will starve subsequent jobs in its queue, but at the same time, giving lower priorities to large jobs would cause them to be starved by a stream of higher-priority jobs. Today one solution to this problem is to create separate MapReduce clusters for different user groups with Hadoop On-Demand, but this hurts system utilization because a group's cluster may be mostly idle for long periods of time. HADOOP-3445 also addresses this problem by sharing a cluster between different queues, but still provides only FIFO scheduling within a queue.

      This JIRA proposes a job scheduler based on fair sharing. Fair sharing splits up compute time proportionally between jobs that have been submitted, emulating an "ideal" scheduler that gives each job 1/Nth of the available capacity. When there is a single job running, that job receives all the capacity. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that everyone gets roughly the same amount of compute time. This lets short jobs finish in reasonable amounts of time while not starving long jobs. This is the type of scheduling used or emulated by operating systems - e.g. the Completely Fair Scheduler in Linux. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that a job should get.

      In addition, the scheduler will support a way to guarantee capacity for particular jobs or user groups. A job can be marked as belonging to a "pool" using a parameter in the jobconf. An "allocations" file on the JobTracker can assign a minimum allocation to each pool, which is a minimum number of map slots and reduce slots that the pool must be guaranteed to get when it contains jobs. The scheduler will ensure that each pool gets at least its minimum allocation when it contains jobs, but it will use fair sharing to assign any excess capacity, as well as the capacity within each pool. This lets an organization divide a cluster between groups similarly to the job queues in HADOOP-3445.

      Implementation Status:

      I've implemented this scheduler using a version of the pluggable scheduler API in HADOOP-3412 that works with Hadoop 0.17. The scheduler supports fair sharing, pools, priorities for weighing job shares, and a text-based allocation config file that is reloaded at runtime whenever it has changed to make it possible to change allocations without restarting the cluster. I will also create a patch for trunk that works with the latest interface in the patch submitted for HADOOP-3412.

      The actual implementation is simple. To implement fair sharing, the scheduler keeps track of a "deficit" for each job - the difference between the amount of compute time it should have gotten on an ideal scheduler, and the amount of compute time it actually got. This is a measure of how "unfair" we've been to the job. Every few hundred milliseconds, the scheduler updates the deficit of each job by looking at how many tasks each job had running during this interval vs. how many it should have had given its weight and the set of jobs that were running in this period. Whenever a task slot becomes available, it is assigned to the job with the highest deficit - unless there were one or more jobs who were not meeting their pool capacity guarantees, in which case we choose among those "needy" jobs based again on their deficit.

      Extensions:

      Once we keep track of pools, weights and deficits, we can do a lot of interesting things with a fair scheduler. One feature I will probably add is an option to give brand new jobs a priority boost until they have run for, say, 10 minutes, to reduce response times even further for short jobs such as ad-hoc queries, while still being fair to longer-running jobs. It would also be easy to add a "maximum number of tasks" cap for each job as in HADOOP-2573 (although with priorities and pools, this JIRA reduces the need for such a cap - you can put a job in its own pool to give it a minimum share, and set its priority to VERY_LOW so it never takes excess capacity if there are other jobs in the cluster). Finally, I may implement "hierarchical pools" - the ability for a group to create pools within its pool, so that it can guarantee minimum allocations to various types of jobs but ensure that together, its jobs get capacity equal to at least its full pool.

      1. hadoop-3746-0.18.3.diff
        204 kB
        Todd Lipcon
      2. hadoop-3746-0.18.3-delta.diff
        1 kB
        Todd Lipcon
      3. fairscheduler-0.18.3.patch
        202 kB
        Matei Zaharia
      4. fairscheduler-0.18.1.patch
        202 kB
        Matei Zaharia
      5. fairscheduler-0.17.2.patch
        153 kB
        Matei Zaharia
      6. fairscheduler-v6.patch
        136 kB
        Matei Zaharia
      7. fairscheduler-v5.2.patch
        118 kB
        Matei Zaharia
      8. fairscheduler-v5.1.patch
        103 kB
        Matei Zaharia
      9. fairscheduler-v5.patch
        102 kB
        Matei Zaharia
      10. fairscheduler-v4.patch
        53 kB
        Matei Zaharia
      11. fairscheduler-v3.1.patch
        38 kB
        Matei Zaharia
      12. fairscheduler-v3.patch
        37 kB
        Matei Zaharia
      13. fairscheduler-v2.patch
        36 kB
        Matei Zaharia
      14. fairscheduler-v1.patch
        36 kB
        Matei Zaharia

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          Attached are a delta since the previous patch as well as a new patch that incorporates both. I ran some MiniMR test cases and they passed, though I have not run the full suite.

          Show
          Todd Lipcon added a comment - Attached are a delta since the previous patch as well as a new patch that incorporates both. I ran some MiniMR test cases and they passed, though I have not run the full suite.
          Hide
          Todd Lipcon added a comment -

          The 0.18.3 patch attached to this ticket introduces a race condition (documented in HADOOP-5852) where the interTrackerServer is started before the taskTrackerManager member is set. Will attach a delta against that patch here

          Show
          Todd Lipcon added a comment - The 0.18.3 patch attached to this ticket introduces a race condition (documented in HADOOP-5852 ) where the interTrackerServer is started before the taskTrackerManager member is set. Will attach a delta against that patch here
          Hide
          Matei Zaharia added a comment -

          (I had previously created this 0.18.3 patch for Cloudera by the way, so don't assume that I hacked it together in 10 minutes and it's untested! It required a small code change since the JobTracker code had changed enough to prevent the old patch from applying.)

          Show
          Matei Zaharia added a comment - (I had previously created this 0.18.3 patch for Cloudera by the way, so don't assume that I hacked it together in 10 minutes and it's untested! It required a small code change since the JobTracker code had changed enough to prevent the old patch from applying.)
          Hide
          Matei Zaharia added a comment -

          The 0.18.1 patch should also work with 0.18.2, but doesn't work with 0.18.3. I've attached a patch for 0.18.3.

          Show
          Matei Zaharia added a comment - The 0.18.1 patch should also work with 0.18.2, but doesn't work with 0.18.3. I've attached a patch for 0.18.3.
          Hide
          Bill Au added a comment -

          Will the patch for 0.18.1 works for 0.18.2 or 0.18.3 also?

          Show
          Bill Au added a comment - Will the patch for 0.18.1 works for 0.18.2 or 0.18.3 also?
          Hide
          Matei Zaharia added a comment -

          For those interested, here is a patch for 0.18.1, which includes HADOOP-3412 and the fair scheduler.

          Show
          Matei Zaharia added a comment - For those interested, here is a patch for 0.18.1, which includes HADOOP-3412 and the fair scheduler.
          Hide
          Matei Zaharia added a comment -

          Hi Eric,

          Are you trying to use the patch attached to this JIRA? This patch is for 0.19, which introduced API support for plugging in different schedulers (see HADOOP-3412), so it won't apply easily in 0.18.1. If you can't upgrade to 0.19 at this time, I also have a patch for 0.17.2 which includes minimal changes to the JobTracker to support the pluggable scheduler API. I'll attach this to the JIRA too. I haven't tested that particular patch with 0.18, but I think it should apply more cleanly. (Will let you know when I have some time to test it.)

          Show
          Matei Zaharia added a comment - Hi Eric, Are you trying to use the patch attached to this JIRA? This patch is for 0.19, which introduced API support for plugging in different schedulers (see HADOOP-3412 ), so it won't apply easily in 0.18.1. If you can't upgrade to 0.19 at this time, I also have a patch for 0.17.2 which includes minimal changes to the JobTracker to support the pluggable scheduler API. I'll attach this to the JIRA too. I haven't tested that particular patch with 0.18, but I think it should apply more cleanly. (Will let you know when I have some time to test it.)
          Hide
          The Frizz added a comment -

          I can't get the patch to apply to 0.18.1, has anyone tried this? It seems like an ant build config problem. Any help would be greatly appreciated.

          Show
          The Frizz added a comment - I can't get the patch to apply to 0.18.1, has anyone tried this? It seems like an ant build config problem. Any help would be greatly appreciated.
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #589 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/589/ )
          Hide
          Hemanth Yamijala added a comment -

          Thanks Owen! When HADOOP-3930 is finalized, I'll post a patch to support it in this scheduler.

          Matei, we are working on a version of HADOOP-3930 (based on the approach being discussed there). That would have the minimal changes to the FairScheduler to make it still compile and work with the new UI. But likely it may not be doing the complete thing as that's best known to you. Please watch for that patch so you can review it, and then you could open a new JIRA to enhance the UI as per changes in HADOOP-3930.

          Show
          Hemanth Yamijala added a comment - Thanks Owen! When HADOOP-3930 is finalized, I'll post a patch to support it in this scheduler. Matei, we are working on a version of HADOOP-3930 (based on the approach being discussed there). That would have the minimal changes to the FairScheduler to make it still compile and work with the new UI. But likely it may not be doing the complete thing as that's best known to you. Please watch for that patch so you can review it, and then you could open a new JIRA to enhance the UI as per changes in HADOOP-3930 .
          Hide
          Tsz Wo Nicholas Sze added a comment -

          TestFairScheduler failed on Linux. See HADOOP-4050.

          Show
          Tsz Wo Nicholas Sze added a comment - TestFairScheduler failed on Linux. See HADOOP-4050 .
          Hide
          Matei Zaharia added a comment -

          Thanks Owen! When HADOOP-3930 is finalized, I'll post a patch to support it in this scheduler.

          Show
          Matei Zaharia added a comment - Thanks Owen! When HADOOP-3930 is finalized, I'll post a patch to support it in this scheduler.
          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Matei!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Matei!
          Hide
          dhruba borthakur added a comment -

          I think this patch should be committed. We have been running this internally for the last few weeks without any problems. The unit test failures are not caused by this patch.

          Show
          dhruba borthakur added a comment - I think this patch should be committed. We have been running this internally for the last few weeks without any problems. The unit test failures are not caused by this patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12388201/fairscheduler-v6.patch
          against trunk revision 687868.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388201/fairscheduler-v6.patch against trunk revision 687868. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3079/console This message is automatically generated.
          Hide
          Matei Zaharia added a comment -

          Resubmitting patch because auto-tester seemed to be broken.

          Show
          Matei Zaharia added a comment - Resubmitting patch because auto-tester seemed to be broken.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12388201/fairscheduler-v6.patch
          against trunk revision 686362.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388201/fairscheduler-v6.patch against trunk revision 686362. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3062/console This message is automatically generated.
          Hide
          Tom White added a comment -

          +1

          This looks good to me. The README is excellent.

          > Although values() / allOf() probably give the values in the order they're defined in the enum, I wanted to make it explicit.

          Fair enough - there's not much chance of a new value being added here, so it's fine. (Both the values() method and the EnumSet#allOf() method guarantee to return the values in the order they are defined in the enum. See http://java.sun.com/docs/books/jls/third_edition/html/classes.html#302265 and http://java.sun.com/javase/6/docs/api/java/util/EnumSet.html)

          Show
          Tom White added a comment - +1 This looks good to me. The README is excellent. > Although values() / allOf() probably give the values in the order they're defined in the enum, I wanted to make it explicit. Fair enough - there's not much chance of a new value being added here, so it's fine. (Both the values() method and the EnumSet#allOf() method guarantee to return the values in the order they are defined in the enum. See http://java.sun.com/docs/books/jls/third_edition/html/classes.html#302265 and http://java.sun.com/javase/6/docs/api/java/util/EnumSet.html )
          Hide
          Matei Zaharia added a comment -

          Here is hopefully the last patch in terms of features. This one adds support for limiting the number of running jobs in from each queue and from each user, similar to the behavior specified in HADOOP-3421. This is useful for clusters where many jobs are submitted at once by the same entity and we wish to limit how many can run at the same time to improve performance (by having each job finish faster, using up less temporary storage, etc). This patch also changes the config file format to XML so that more types of configuration elements can be included. There are three new unit tests for this functionality and an example config file in the ReadMe.

          Show
          Matei Zaharia added a comment - Here is hopefully the last patch in terms of features. This one adds support for limiting the number of running jobs in from each queue and from each user, similar to the behavior specified in HADOOP-3421 . This is useful for clusters where many jobs are submitted at once by the same entity and we wish to limit how many can run at the same time to improve performance (by having each job finish faster, using up less temporary storage, etc). This patch also changes the config file format to XML so that more types of configuration elements can be included. There are three new unit tests for this functionality and an example config file in the ReadMe.
          Hide
          Matei Zaharia added a comment -

          Thanks for your comments, Tom! Here's a new patch, including fixed license headers, a build file, no scheduler singleton and a detailed ReadMe.

          Regarding the enum values - I've actually used for (TaskType type: TaskType.values()) to iterate through them in some places, but in assignTasks in particular, I wanted to make sure I look at maps first and then reduces, because I want to assign maps before reduces. Although values() / allOf() probably give the values in the order they're defined in the enum, I wanted to make it explicit.

          Show
          Matei Zaharia added a comment - Thanks for your comments, Tom! Here's a new patch, including fixed license headers, a build file, no scheduler singleton and a detailed ReadMe. Regarding the enum values - I've actually used for (TaskType type: TaskType.values()) to iterate through them in some places, but in assignTasks in particular, I wanted to make sure I look at maps first and then reduces, because I want to assign maps before reduces. Although values() / allOf() probably give the values in the order they're defined in the enum, I wanted to make it explicit.
          Hide
          Tom White added a comment -

          A few comments:

          • Rather than having a static instance field on FairScheduler, could we not store the FairScheduler instance in the servlet application context? E.g. in FairScheduler#start
            infoServer.setAttribute("scheduler", this);
            

            and then retrieve the instance in the servlet's init method.

          • The TaskType enum is useful (it might be good to use it in core). A small point, but the following is a shorter way of iterating over map and reduce types than creating an array of types (see FairScheduler#assignTasks):
          for (TaskType taskType: EnumSet.allOf(TaskType.class)) {
          
          }
          
          • FairScheduler#sortJobs isn't called anywhere as far as I can see. Can we get rid of it?
          • It would be useful to document the format of the allocation config file. Also, a README giving a brief description of the scheduler, and how to use it would be good.
          • We need a build file, something like this should do:
            <project name="fairscheduler" default="jar">
            
              <import file="../build-contrib.xml"/>
            
            </project>
            
          Show
          Tom White added a comment - A few comments: Rather than having a static instance field on FairScheduler, could we not store the FairScheduler instance in the servlet application context? E.g. in FairScheduler#start infoServer.setAttribute( "scheduler" , this ); and then retrieve the instance in the servlet's init method. The TaskType enum is useful (it might be good to use it in core). A small point, but the following is a shorter way of iterating over map and reduce types than creating an array of types (see FairScheduler#assignTasks): for (TaskType taskType: EnumSet.allOf(TaskType.class)) { } FairScheduler#sortJobs isn't called anywhere as far as I can see. Can we get rid of it? It would be useful to document the format of the allocation config file. Also, a README giving a brief description of the scheduler, and how to use it would be good. We need a build file, something like this should do: <project name= "fairscheduler" default = "jar" > < import file= "../build-contrib.xml" /> </project>
          Hide
          Tom White added a comment -

          > I'm worried about a hand full of schedulers all putting a bunch of classes into the mapred package.

          +1

          > I think we either need to put the schedulers into packages or put them into contrib.

          I would vote to commit this into contrib. If we wanted to move it into core at a later date we should put it in its own package, but that really needs HADOOP-3916 and HADOOP-3822 to be done first.

          Show
          Tom White added a comment - > I'm worried about a hand full of schedulers all putting a bunch of classes into the mapred package. +1 > I think we either need to put the schedulers into packages or put them into contrib. I would vote to commit this into contrib. If we wanted to move it into core at a later date we should put it in its own package, but that really needs HADOOP-3916 and HADOOP-3822 to be done first.
          Hide
          Matei Zaharia added a comment -

          Thanks for the clarification, I'll put it in the next version.

          Show
          Matei Zaharia added a comment - Thanks for the clarification, I'll put it in the next version.
          Hide
          Hemanth Yamijala added a comment -

          In this case though, it looks like TestFairScheduler doesn't have the header. So, you'll need to add that. Details were found here: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/current/releaseAuditDiffWarnings.txt

          Show
          Hemanth Yamijala added a comment - In this case though, it looks like TestFairScheduler doesn't have the header. So, you'll need to add that. Details were found here: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Hide
          Arun C Murthy added a comment -

          Also, I don't understand the release audit warning; what does it mean?

          You might have some files with missing licenses...

          Show
          Arun C Murthy added a comment - Also, I don't understand the release audit warning; what does it mean? You might have some files with missing licenses...
          Hide
          Hemanth Yamijala added a comment -

          Matei, the release audit warning usually means one of the files in the patch doesn't have the ASF license header. Typically this is the case if you have new configuration files. If that's the case, it can be ignored. I just add a comment to the JIRA explaining the warning is due to this reason.

          Show
          Hemanth Yamijala added a comment - Matei, the release audit warning usually means one of the files in the patch doesn't have the ASF license header. Typically this is the case if you have new configuration files. If that's the case, it can be ignored. I just add a comment to the JIRA explaining the warning is due to this reason.
          Hide
          Matei Zaharia added a comment -

          New patch with one additional feature - ability to set jobs' weights based on their size (so bigger jobs get bigger shares).

          Also, I don't understand the release audit warning; what does it mean?

          Show
          Matei Zaharia added a comment - New patch with one additional feature - ability to set jobs' weights based on their size (so bigger jobs get bigger shares). Also, I don't understand the release audit warning; what does it mean?
          Hide
          Matei Zaharia added a comment -

          Hi Owen,

          It's definitely possible to put the schedulers in different classes
          but this would require making a public API to the package-private
          classes and field in the mapred package (https://issues.apache.org/jira/browse/HADOOP-3822
          ). If we don't want to do that right away then keeping stuff in
          contrib will work, as long as people don't try to run with multiple
          schedulers with conflicting class names on their classpath.

          Show
          Matei Zaharia added a comment - Hi Owen, It's definitely possible to put the schedulers in different classes but this would require making a public API to the package-private classes and field in the mapred package ( https://issues.apache.org/jira/browse/HADOOP-3822 ). If we don't want to do that right away then keeping stuff in contrib will work, as long as people don't try to run with multiple schedulers with conflicting class names on their classpath.
          Hide
          Owen O'Malley added a comment -

          This is looking good, but I'm worried about a hand full of schedulers all putting a bunch of classes into the mapred package. I think we either need to put the schedulers into packages or put them into contrib. Thoughts?

          Show
          Owen O'Malley added a comment - This is looking good, but I'm worried about a hand full of schedulers all putting a bunch of classes into the mapred package. I think we either need to put the schedulers into packages or put them into contrib. Thoughts?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12387391/fairscheduler-v5.patch
          against trunk revision 682978.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 216 release audit warnings (more than the trunk's current 215 warnings).

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/current/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12387391/fairscheduler-v5.patch against trunk revision 682978. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 216 release audit warnings (more than the trunk's current 215 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/current/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3005/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Cool. The UI looks great.

          1. Can the scheduler look for modification time of the config file to decide whether to read it in or not?
          2. A unit test to trigger the code that reads in the config file while it is being edited by the user
          3. Possibility of using FIFO scheduler and/or FairShare scheduling inside a pool.
          4. Is there any way to test the locking model of the scheduler?

          Show
          dhruba borthakur added a comment - Cool. The UI looks great. 1. Can the scheduler look for modification time of the config file to decide whether to read it in or not? 2. A unit test to trigger the code that reads in the config file while it is being edited by the user 3. Possibility of using FIFO scheduler and/or FairShare scheduling inside a pool. 4. Is there any way to test the locking model of the scheduler?
          Hide
          Matei Zaharia added a comment -

          New patch that includes unit tests, formatting according to Hadoop code standards, and documentation.

          Show
          Matei Zaharia added a comment - New patch that includes unit tests, formatting according to Hadoop code standards, and documentation.
          Hide
          Matei Zaharia added a comment -

          New patch that includes unit tests, formatting according to Hadoop code style, and more complete documentation.

          Show
          Matei Zaharia added a comment - New patch that includes unit tests, formatting according to Hadoop code style, and more complete documentation.
          Hide
          Vivek Ratan added a comment -

          I've been thinking of this as well, and have had a few discussions with folks. I've opened a separate Jira (HADOOP-3876). we should discuss options there. Sound good?

          Show
          Vivek Ratan added a comment - I've been thinking of this as well, and have had a few discussions with folks. I've opened a separate Jira ( HADOOP-3876 ). we should discuss options there. Sound good?
          Hide
          Matei Zaharia added a comment -

          I also have another question for the folks watching this issue. So far I've organized my patch to have my code in contrib, but would it be better to have it in src/java/mapred instead? I think this would have two advantages. First, other schedulers, such as HADOOP-3445, would be able to use fair sharing within a queue if they wish to do so in the future (V4 of my patch has all of the scheduling code in one class, but V3 had a separate FairSharingJobSelector that could be called from other schedulers, and it would be easy to separate these two again if there is interest). Second, I think new organizations using Hadoop might want to be able to use fair sharing instead of the FIFO scheduler if the are just setting up a cluster and sharing it between a few users. It's an intuitive, easy to understand policy. Having the scheduler in Hadoop would let them set one config parameter to do this, without having to explicitly create user groups and assign jobs to queues at submit time as in HADOOP-3445. At the same time, the fair scheduler lets you set up and edit special queues when you need them (even at runtime, without restarting the JobTracker).

          Show
          Matei Zaharia added a comment - I also have another question for the folks watching this issue. So far I've organized my patch to have my code in contrib, but would it be better to have it in src/java/mapred instead? I think this would have two advantages. First, other schedulers, such as HADOOP-3445 , would be able to use fair sharing within a queue if they wish to do so in the future (V4 of my patch has all of the scheduling code in one class, but V3 had a separate FairSharingJobSelector that could be called from other schedulers, and it would be easy to separate these two again if there is interest). Second, I think new organizations using Hadoop might want to be able to use fair sharing instead of the FIFO scheduler if the are just setting up a cluster and sharing it between a few users. It's an intuitive, easy to understand policy. Having the scheduler in Hadoop would let them set one config parameter to do this, without having to explicitly create user groups and assign jobs to queues at submit time as in HADOOP-3445 . At the same time, the fair scheduler lets you set up and edit special queues when you need them (even at runtime, without restarting the JobTracker).
          Hide
          Matei Zaharia added a comment -

          Here is a new patch incorporating a lot of improvements. There highlights are bug fixes to the fair sharing calculations, support for speculative execution (using the API in https://issues.apache.org/jira/browse/HADOOP-3840), clearer code organization, and a web interface for viewing scheduler state available at <job tracker URL>/scheduler. The web UI also provides support for switching to FIFO.

          Almost all the code is in contrib, but there is one change in mapred to make the activeTasks variable in TaskInProgress accessible to other classes in the same package - this is necessary for properly counting active tasks in each job.

          The code is more or less usable as is. A few things I want to add include unit tests and support on the web UI for changing jobs' priorities and pools, and a README file about the config options. GIven that most of the changes are in contrib, what's will be the process for getting this reviewed and committed once it is testable?

          Show
          Matei Zaharia added a comment - Here is a new patch incorporating a lot of improvements. There highlights are bug fixes to the fair sharing calculations, support for speculative execution (using the API in https://issues.apache.org/jira/browse/HADOOP-3840 ), clearer code organization, and a web interface for viewing scheduler state available at <job tracker URL>/scheduler. The web UI also provides support for switching to FIFO. Almost all the code is in contrib, but there is one change in mapred to make the activeTasks variable in TaskInProgress accessible to other classes in the same package - this is necessary for properly counting active tasks in each job. The code is more or less usable as is. A few things I want to add include unit tests and support on the web UI for changing jobs' priorities and pools, and a README file about the config options. GIven that most of the changes are in contrib, what's will be the process for getting this reviewed and committed once it is testable?
          Hide
          steve_l added a comment -

          linking to HADOOP-3412, which is now committed, though possibly still unstable.

          Show
          steve_l added a comment - linking to HADOOP-3412 , which is now committed, though possibly still unstable.
          Hide
          Matei Zaharia added a comment -

          Some bug fixes.

          Show
          Matei Zaharia added a comment - Some bug fixes.
          Hide
          Matei Zaharia added a comment -

          Here is an updated patch which simplifies the logic by putting it into FairSharingJobSelector and also fixes a bug in the previous calculation of fair share (it didn't take into account minimums, so jobs would consider themselves "starved" if some job with a large allocation was running). Once HADOOP-3412 is committed, I will update this patch to work with the final API in there.

          Show
          Matei Zaharia added a comment - Here is an updated patch which simplifies the logic by putting it into FairSharingJobSelector and also fixes a bug in the previous calculation of fair share (it didn't take into account minimums, so jobs would consider themselves "starved" if some job with a large allocation was running). Once HADOOP-3412 is committed, I will update this patch to work with the final API in there.
          Hide
          Matei Zaharia added a comment -

          Vivek - You're right, JIRAs like this one and HADOOP-3445 are adding a lot of capabilities to the scheduling. Some of them might overlap, but I think it's useful to have several implementations to choose from - this is done by Linux for example. The philosophy behind this patch was to create a scheduler that "just works" in the common use case of multiple users running jobs of various lengths, without requiring any configuration and administration work such as setting up pools, etc, but also provide enough capabilities to allow group allocations to be set up when necessary. This is the situation we want to support at Facebook (where I'm doing this work) - there is a Hadoop cluster which has regular users, but it is also becoming popular with external users who want to run a variety of jobs. Most jobs run as root, and it's tough to require everybody to specify a project name or task cap, so we want the default behavior to be sensible and easy to understand.

          Some of the code here might be useful for HADOOP-3445 if you provide the capability to have a per-queue scheduler - then we could have fair scheduling for some of the queues (hopefully also including the default queue).

          At the same time, I want to add a few extensibility points to this scheduler. First of all, I want to add a way to extend the weight and deficit calculations, perhaps by providing multiple Adjuster classes that can be chained together. This could be used for example to boost priority of new jobs during their first few minutes (reducing response times for interactive queries), to take into account locality when deciding which job to assign to each slot, etc. Second, I've already implemented an interface called LoadManager that takes care of how man tasks should run on each taskTracker. This currently uses the caps, but an alternate implementation that we might try is to assign caps based on load (start more tasks on nodes where the running tasks are not utilizing all of the CPU, bandwidth and memory). The scheduler is also pretty modular and it's easy to change or reuse particular components, like the JobSelectors.

          Show
          Matei Zaharia added a comment - Vivek - You're right, JIRAs like this one and HADOOP-3445 are adding a lot of capabilities to the scheduling. Some of them might overlap, but I think it's useful to have several implementations to choose from - this is done by Linux for example. The philosophy behind this patch was to create a scheduler that "just works" in the common use case of multiple users running jobs of various lengths, without requiring any configuration and administration work such as setting up pools, etc, but also provide enough capabilities to allow group allocations to be set up when necessary. This is the situation we want to support at Facebook (where I'm doing this work) - there is a Hadoop cluster which has regular users, but it is also becoming popular with external users who want to run a variety of jobs. Most jobs run as root, and it's tough to require everybody to specify a project name or task cap, so we want the default behavior to be sensible and easy to understand. Some of the code here might be useful for HADOOP-3445 if you provide the capability to have a per-queue scheduler - then we could have fair scheduling for some of the queues (hopefully also including the default queue). At the same time, I want to add a few extensibility points to this scheduler. First of all, I want to add a way to extend the weight and deficit calculations, perhaps by providing multiple Adjuster classes that can be chained together. This could be used for example to boost priority of new jobs during their first few minutes (reducing response times for interactive queries), to take into account locality when deciding which job to assign to each slot, etc. Second, I've already implemented an interface called LoadManager that takes care of how man tasks should run on each taskTracker. This currently uses the caps, but an alternate implementation that we might try is to assign caps based on load (start more tasks on nodes where the running tasks are not utilizing all of the CPU, bandwidth and memory). The scheduler is also pretty modular and it's easy to change or reuse particular components, like the JobSelectors.
          Hide
          Matei Zaharia added a comment -

          For those interested, here is a preliminary patch that implements the fair scheduler and works with the 9.2 patch for HADOOP-3412. It would work with the 9.1 patch also as long as you add type params to TaskTrackerManager.taskTrackers(). I've put the code in contrib but kept it inside the mapred package because it looks like HADOOP-3412 might not make it possible to write schedulers in other packages due to some types having package visibility.

          The only caveat with this patch is that after you build it, you have to set HADOOP_CLASSPATH to include build/contrib/poolscheduler/classes in hadoop-env.sh because the classes get placed there. Also, you must create a pool config file (say conf/pools) and set the jobconf variable "facebook.scheduler.allocation.file" to point to it. This file can be empty, or it can contain lines of the format <poolName> <minMappers> <minReducers> to specify minimum allocations for various pools.

          The patch supports fair sharing with weights by priority (every priority level gets 2x the weight of the previous level) and pools with minimum numbers of reducers and mappers. The pool config file is reloaded every 10 seconds so allocations can be changed while a cluster is running. The patch also includes some general tools that might be useful to schedulers, such as JobSelector and its subclasses, and the static methods in JobUtils.

          Show
          Matei Zaharia added a comment - For those interested, here is a preliminary patch that implements the fair scheduler and works with the 9.2 patch for HADOOP-3412 . It would work with the 9.1 patch also as long as you add type params to TaskTrackerManager.taskTrackers(). I've put the code in contrib but kept it inside the mapred package because it looks like HADOOP-3412 might not make it possible to write schedulers in other packages due to some types having package visibility. The only caveat with this patch is that after you build it, you have to set HADOOP_CLASSPATH to include build/contrib/poolscheduler/classes in hadoop-env.sh because the classes get placed there. Also, you must create a pool config file (say conf/pools) and set the jobconf variable "facebook.scheduler.allocation.file" to point to it. This file can be empty, or it can contain lines of the format <poolName> <minMappers> <minReducers> to specify minimum allocations for various pools. The patch supports fair sharing with weights by priority (every priority level gets 2x the weight of the previous level) and pools with minimum numbers of reducers and mappers. The pool config file is reloaded every 10 seconds so allocations can be changed while a cluster is running. The patch also includes some general tools that might be useful to schedulers, such as JobSelector and its subclasses, and the static methods in JobUtils.
          Hide
          Vivek Ratan added a comment -

          Matei, sounds like you're extending the current Scheduler in some really beneficial ways. Look forward to seeing your patch.

          I just wanted to add: HADOOP-3445 supports user limits (a single user can only use up to a proportion of the queue), which, we think, will prevent starvation (when combined with capacities for queues). This feature should also solve HADOOP-2573.

          It'll be really interesting to see how these two schedulers handle user problems today, such as starvation of jobs, or fairness problems. I think there're a lot of similarities in the approaches, but perhaps enough differences that will manifest in different ways on a cluster. Hopefully, once we have patches out for both Jiras, and enough experience using them, we can find some common solutions to common problems with Hadoop scheduling. One of our goals, driven by the work for 3412, is to make the scheduler for 3421 extensible, so folks can tweak algorithms at different levels and find what works best for them, as well as enhance the default algorithm. I hope that the same is possible with your design - it'll be a very useful feature.

          My real point is, between HADOOP-3412, HADOOP-3421 (and its implementation offshoots, including HADOOP-3445), and HADOOP-3746, as well as other Jiras these might influence, there's a lot of good work going on now in dealing with Hadoop scheduling and looking at effective ways to improve the utilization and usability of Hadoop. It's about time! We need this effort.

          Show
          Vivek Ratan added a comment - Matei, sounds like you're extending the current Scheduler in some really beneficial ways. Look forward to seeing your patch. I just wanted to add: HADOOP-3445 supports user limits (a single user can only use up to a proportion of the queue), which, we think, will prevent starvation (when combined with capacities for queues). This feature should also solve HADOOP-2573 . It'll be really interesting to see how these two schedulers handle user problems today, such as starvation of jobs, or fairness problems. I think there're a lot of similarities in the approaches, but perhaps enough differences that will manifest in different ways on a cluster. Hopefully, once we have patches out for both Jiras, and enough experience using them, we can find some common solutions to common problems with Hadoop scheduling. One of our goals, driven by the work for 3412, is to make the scheduler for 3421 extensible, so folks can tweak algorithms at different levels and find what works best for them, as well as enhance the default algorithm. I hope that the same is possible with your design - it'll be a very useful feature. My real point is, between HADOOP-3412 , HADOOP-3421 (and its implementation offshoots, including HADOOP-3445 ), and HADOOP-3746 , as well as other Jiras these might influence, there's a lot of good work going on now in dealing with Hadoop scheduling and looking at effective ways to improve the utilization and usability of Hadoop. It's about time! We need this effort.

            People

            • Assignee:
              Matei Zaharia
              Reporter:
              Matei Zaharia
            • Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development