Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-706

Support for FIFO pools in the fair scheduler

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: contrib/fair-share
    • Labels:
      None
    • Release Note:
      Support for FIFO pools added to the Fair Scheduler.

      Description

      The fair scheduler should support making the internal scheduling algorithm for some pools be FIFO instead of fair sharing in order to work better for batch workloads. FIFO pools will behave exactly like the current default scheduler, sorting jobs by priority and then submission time. Pools will have their scheduling algorithm set through the pools config file, and it will be changeable at runtime.

      To support this feature, I'm also changing the internal logic of the fair scheduler to no longer use deficits. Instead, for fair sharing, we will assign tasks to the job farthest below its share as a ratio of its share. This is easier to combine with other scheduling algorithms and leads to a more stable sharing situation, avoiding unfairness issues brought up in MAPREDUCE-543 and MAPREDUCE-544 that happen when some jobs have long tasks. The new preemption (MAPREDUCE-551) will ensure that critical jobs can gain their fair share within a bounded amount of time.

      1. mapreduce-706.v5.patch
        255 kB
        Matei Zaharia
      2. mapreduce-706.v4.patch
        229 kB
        Matei Zaharia
      3. mapreduce-706.v3.patch
        225 kB
        Matei Zaharia
      4. mapreduce-706.v2.patch
        221 kB
        Matei Zaharia
      5. mapreduce-706.v1.patch
        215 kB
        Matei Zaharia
      6. fsdesigndoc.tex
        25 kB
        Matei Zaharia
      7. fsdesigndoc.pdf
        113 kB
        Matei Zaharia
      8. mapreduce-706.patch
        217 kB
        Matei Zaharia

        Activity

        Hide
        Tsz Wo Nicholas Sze added a comment -

        > ... With the PDF, I'm worried that many people won't have LaTeX installed, and it will be difficult to write an ant task to compile it, ...
        I agree. Let's leave the pdf file there for the moment. We may come up a solution later.

        Show
        Tsz Wo Nicholas Sze added a comment - > ... With the PDF, I'm worried that many people won't have LaTeX installed, and it will be difficult to write an ant task to compile it, ... I agree. Let's leave the pdf file there for the moment. We may come up a solution later.
        Hide
        Matei Zaharia added a comment -

        I've created MAPREDUCE-878 to rename the design doc and add the license header. With the PDF, I'm worried that many people won't have LaTeX installed, and it will be difficult to write an ant task to compile it, but if having a PDF in there is a problem, I can remove it. It's only 120 KB in size though.

        Show
        Matei Zaharia added a comment - I've created MAPREDUCE-878 to rename the design doc and add the license header. With the PDF, I'm worried that many people won't have LaTeX installed, and it will be difficult to write an ant task to compile it, but if having a PDF in there is a problem, I can remove it. It's only 120 KB in size though.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        The design doc is very nice (especially, it was typeset by tex)!

        Some suggestions for future works:

        • In hadoop, fs usually refers to FileSystem. "fsdesigndoc" sounds like FileSystem design doc. I think we should prevent overloading the term "fs".
        • The tex file needs a license header.
        • We do not have pdf files under ./src before. "fsdesigndoc.pdf" is the first. I think the correct approach is to generate the pdf file by "ant docs". However, it may not be easy to do so.
        Show
        Tsz Wo Nicholas Sze added a comment - The design doc is very nice (especially, it was typeset by tex)! Some suggestions for future works: In hadoop, fs usually refers to FileSystem. "fsdesigndoc" sounds like FileSystem design doc. I think we should prevent overloading the term "fs". The tex file needs a license header. We do not have pdf files under ./src before. "fsdesigndoc.pdf" is the first. I think the correct approach is to generate the pdf file by "ant docs". However, it may not be easy to do so.
        Hide
        Matei Zaharia added a comment -

        The test failures are again in unrelated tests, so I've committed this. Thanks for the reviews, Aaron and Tom!

        Show
        Matei Zaharia added a comment - The test failures are again in unrelated tests, so I've committed this. Thanks for the reviews, Aaron and Tom!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12416467/mapreduce-706.v5.patch
        against trunk revision 803583.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        -1 release audit. The applied patch generated 204 release audit warnings (more than the trunk's current 203 warnings).

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416467/mapreduce-706.v5.patch against trunk revision 803583. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 204 release audit warnings (more than the trunk's current 203 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/472/console This message is automatically generated.
        Hide
        Matei Zaharia added a comment -

        I'm going to wait for Hudson to run on this and then commit it if it looks okay.

        Show
        Matei Zaharia added a comment - I'm going to wait for Hudson to run on this and then commit it if it looks okay.
        Hide
        Matei Zaharia added a comment -

        Here's a new patch that includes the design docs.

        Show
        Matei Zaharia added a comment - Here's a new patch that includes the design docs.
        Hide
        Matei Zaharia added a comment -

        I tested the patch on a 50-node EC2 cluster. I tried each of the features individually (FIFO pools, delay scheduling, preemption, etc). I also checked that CPU utilization on the master was low (~1%) even when 50 jobs are simultaneously running.

        Show
        Matei Zaharia added a comment - I tested the patch on a 50-node EC2 cluster. I tried each of the features individually (FIFO pools, delay scheduling, preemption, etc). I also checked that CPU utilization on the master was low (~1%) even when 50 jobs are simultaneously running.
        Hide
        Tom White added a comment -

        +1

        What testing did you carry out on this?

        Agree the documentation is excellent. Can you add it to the source tree?

        Show
        Tom White added a comment - +1 What testing did you carry out on this? Agree the documentation is excellent. Can you add it to the source tree?
        Hide
        Matei Zaharia added a comment -

        Contrib test failures are again unrelated.

        Show
        Matei Zaharia added a comment - Contrib test failures are again unrelated.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12415798/mapreduce-706.v4.patch
        against trunk revision 801959.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415798/mapreduce-706.v4.patch against trunk revision 801959. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/456/console This message is automatically generated.
        Hide
        Matei Zaharia added a comment -

        I've made a few changes to the Web UI to show more / better info:

        • Pools now have their scheduling mode and fair shares displayed.
        • Jobs in a FIFO pool have their fair share displayed as NA instead of 0.

        Attaching a patch with these changes.

        Show
        Matei Zaharia added a comment - I've made a few changes to the Web UI to show more / better info: Pools now have their scheduling mode and fair shares displayed. Jobs in a FIFO pool have their fair share displayed as NA instead of 0. Attaching a patch with these changes.
        Hide
        Matei Zaharia added a comment -

        The test failures are in streaming and unrelated to this patch.

        Show
        Matei Zaharia added a comment - The test failures are in streaming and unrelated to this patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12415552/mapreduce-706.v3.patch
        against trunk revision 800693.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415552/mapreduce-706.v3.patch against trunk revision 800693. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/443/console This message is automatically generated.
        Hide
        Aaron Kimball added a comment -

        Sounds good!

        Show
        Aaron Kimball added a comment - Sounds good!
        Hide
        Matei Zaharia added a comment -

        Thanks for the review, Aaron! Here is a new patch taking into account your comments:

        TestFairScheduler.obtainNewReduceTask():
        Task task = new ReduceTask("", attemptId, 0, maps.length, 1) <-- shouldn't this be "reduces.length" ?

        This actually needs to be the number of maps according to the ReduceTask API; it lets it create a data structure for each map.

        TestFairScheduler.getLocalityLevel(): These locality level constants are used throughout the FairScheduler; they should be converted to an Enum. (Magic constants are evil.)

        Fixed. I added a LocalityLevel enum with methods for getting the locality level of a given task and converting a locality level to a cache level cap for obtainNewMapTask.

        TestComputeFairShares.testEmptyList() - should this call verifyShares() after computeFairShares() to assert that the list length is zero?

        Fixed.

        PoolManager.parseSchedulingMode(): why case sensitive 'fifo' and 'fair' ? maybe use toLower() ?

        Fixed.

        PoolSchedulable c'tor: scheduler.getClock().getTime() should be called only once to guarantee this.lastTimeAtMinShare == this.lastTimeAtHalfFairShare on start?

        Fixed.

        assignTask(): Is SchedulingMode guaranteed to never be extended by another internal algorithm? If not, turn "else" into "else if" and have an "else throw InvalidArgumentException" at the end of the case.

        Fixed. Throws RuntimeException if there's another mode, which should cause the JobTracker to exit and make it obvious that there's something deeply wrong.

        JobSchedulable.updateDemand(): why does this use System.currentTimeMillis() instead of getting the time from a Clock object?

        Fixed.

        Schedulable's class javadoc: typo "algoirthms"

        Fixed.

        SchuldingAlgorithms.LOG: rather than use a string, use SchedulingAlgorithms.class.getName()

        Fixed. Also fixed this in PoolSchedulable.

        FairScheduler.UpdateThread.run(): why is preemptTasksIfNecessary() commented out? Needs a comment for rationale.

        Oops! Fixed. I had commented that out for debugging.

        FairScheduler.assignTasks() - Should convert System.out.println to log msg.

        Removed the println, it was also for debugging.

        This method is also getting pretty long. Consider refactoring the inner loop into shorter methods if you need to add anything else to it in the future.

        I've refactored it to take out the delay scheduling code and the cap calculation code. It's now closer to fitting on one screenful.

        getAllowedLocalityLevel():
        You have the comment: // Job not in infos (shouldn't happen)-
        ... So throw an exception if it does, or at least log this e

        Fixed. I log an error so the scheduler doesn't crash if for some reason a bug does cause a job to go through there when it has no info set.

        In addition to these fixes, I've added a bit more documentation on the delay scheduling methods in this version of the patch, and I've changed Schedulable.assignTasks to take the current time as a parameter so that we don't have multiple calls to System.currentTimeMillis on each heartbeat (in case that hurts performance).

        Show
        Matei Zaharia added a comment - Thanks for the review, Aaron! Here is a new patch taking into account your comments: TestFairScheduler.obtainNewReduceTask(): Task task = new ReduceTask("", attemptId, 0, maps.length, 1) <-- shouldn't this be "reduces.length" ? This actually needs to be the number of maps according to the ReduceTask API; it lets it create a data structure for each map. TestFairScheduler.getLocalityLevel(): These locality level constants are used throughout the FairScheduler; they should be converted to an Enum. (Magic constants are evil.) Fixed. I added a LocalityLevel enum with methods for getting the locality level of a given task and converting a locality level to a cache level cap for obtainNewMapTask. TestComputeFairShares.testEmptyList() - should this call verifyShares() after computeFairShares() to assert that the list length is zero? Fixed. PoolManager.parseSchedulingMode(): why case sensitive 'fifo' and 'fair' ? maybe use toLower() ? Fixed. PoolSchedulable c'tor: scheduler.getClock().getTime() should be called only once to guarantee this.lastTimeAtMinShare == this.lastTimeAtHalfFairShare on start? Fixed. assignTask(): Is SchedulingMode guaranteed to never be extended by another internal algorithm? If not, turn "else" into "else if" and have an "else throw InvalidArgumentException" at the end of the case. Fixed. Throws RuntimeException if there's another mode, which should cause the JobTracker to exit and make it obvious that there's something deeply wrong. JobSchedulable.updateDemand(): why does this use System.currentTimeMillis() instead of getting the time from a Clock object? Fixed. Schedulable's class javadoc: typo "algoirthms" Fixed. SchuldingAlgorithms.LOG: rather than use a string, use SchedulingAlgorithms.class.getName() Fixed. Also fixed this in PoolSchedulable. FairScheduler.UpdateThread.run(): why is preemptTasksIfNecessary() commented out? Needs a comment for rationale. Oops! Fixed. I had commented that out for debugging. FairScheduler.assignTasks() - Should convert System.out.println to log msg. Removed the println, it was also for debugging. This method is also getting pretty long. Consider refactoring the inner loop into shorter methods if you need to add anything else to it in the future. I've refactored it to take out the delay scheduling code and the cap calculation code. It's now closer to fitting on one screenful. getAllowedLocalityLevel(): You have the comment: // Job not in infos (shouldn't happen)- ... So throw an exception if it does, or at least log this e Fixed. I log an error so the scheduler doesn't crash if for some reason a bug does cause a job to go through there when it has no info set. In addition to these fixes, I've added a bit more documentation on the delay scheduling methods in this version of the patch, and I've changed Schedulable.assignTasks to take the current time as a parameter so that we don't have multiple calls to System.currentTimeMillis on each heartbeat (in case that hurts performance).
        Hide
        Aaron Kimball added a comment -

        Matei,

        Great documentation – that really helps! Also good that you added a lot of tests. +1 overall on this patch, subject to the following (relatively minor) questions and suggestions:

        TestFairScheduler.obtainNewReduceTask():
        Task task = new ReduceTask("", attemptId, 0, maps.length, 1) <-- shouldn't this be "reduces.length" ?

        TestFairScheduler.getLocalityLevel(): These locality level constants are used throughout the FairScheduler; they should be converted to an Enum. (Magic constants are evil.)

        TestComputeFairShares.testEmptyList() – should this call verifyShares() after computeFairShares() to assert that the list length is zero?

        PoolManager.parseSchedulingMode(): why case sensitive 'fifo' and 'fair' ? maybe use toLower() ?

        PoolSchedulable c'tor: scheduler.getClock().getTime() should be called only once to guarantee this.lastTimeAtMinShare == this.lastTimeAtHalfFairShare on start?

        assignTask(): Is SchedulingMode guaranteed to never be extended by another internal algorithm? If not, turn "else" into "else if" and have an "else throw InvalidArgumentException" at the end of the case.

        JobSchedulable.updateDemand(): why does this use System.currentTimeMillis() instead of getting the time from a Clock object?

        Schedulable's class javadoc: typo "algoirthms"

        SchuldingAlgorithms.LOG: rather than use a string, use SchedulingAlgorithms.class.getName()

        FairScheduler.UpdateThread.run(): why is preemptTasksIfNecessary() commented out? Needs a comment for rationale.

        FairScheduler.assignTasks() – Should convert System.out.println to log msg.

        This method is also getting pretty long. Consider refactoring the inner loop into shorter methods if you need to add anything else to it in the future.

        getAllowedLocalityLevel():
        You have the comment: // Job not in infos (shouldn't happen)-
        ... So throw an exception if it does, or at least log this event with level ERROR, rather than returning an in-bounds value? When you get to switch(info.lastMapLocalityLevel), you'll naturally throw an NPE, so the caller should just deal with that and clean up its own mess.

        Show
        Aaron Kimball added a comment - Matei, Great documentation – that really helps! Also good that you added a lot of tests. +1 overall on this patch, subject to the following (relatively minor) questions and suggestions: TestFairScheduler.obtainNewReduceTask(): Task task = new ReduceTask("", attemptId, 0, maps.length, 1) <-- shouldn't this be "reduces.length" ? TestFairScheduler.getLocalityLevel(): These locality level constants are used throughout the FairScheduler; they should be converted to an Enum. (Magic constants are evil.) TestComputeFairShares.testEmptyList() – should this call verifyShares() after computeFairShares() to assert that the list length is zero? PoolManager.parseSchedulingMode(): why case sensitive 'fifo' and 'fair' ? maybe use toLower() ? PoolSchedulable c'tor: scheduler.getClock().getTime() should be called only once to guarantee this.lastTimeAtMinShare == this.lastTimeAtHalfFairShare on start? assignTask(): Is SchedulingMode guaranteed to never be extended by another internal algorithm? If not, turn "else" into "else if" and have an "else throw InvalidArgumentException" at the end of the case. JobSchedulable.updateDemand(): why does this use System.currentTimeMillis() instead of getting the time from a Clock object? Schedulable's class javadoc: typo "algoirthms" SchuldingAlgorithms.LOG: rather than use a string, use SchedulingAlgorithms.class.getName() FairScheduler.UpdateThread.run(): why is preemptTasksIfNecessary() commented out? Needs a comment for rationale. FairScheduler.assignTasks() – Should convert System.out.println to log msg. This method is also getting pretty long. Consider refactoring the inner loop into shorter methods if you need to add anything else to it in the future. getAllowedLocalityLevel(): You have the comment: // Job not in infos (shouldn't happen)- ... So throw an exception if it does, or at least log this event with level ERROR, rather than returning an in-bounds value? When you get to switch(info.lastMapLocalityLevel), you'll naturally throw an NPE, so the caller should just deal with that and clean up its own mess.
        Hide
        Matei Zaharia added a comment -

        I've fixed the release audit warnings by adding Apache license headers to the files in question. The contrib test failures are unrelated to this patch.

        Show
        Matei Zaharia added a comment - I've fixed the release audit warnings by adding Apache license headers to the files in question. The contrib test failures are unrelated to this patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12414820/mapreduce-706.v1.patch
        against trunk revision 799126.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 9 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        -1 release audit. The applied patch generated 210 release audit warnings (more than the trunk's current 203 warnings).

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/testReport/
        Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12414820/mapreduce-706.v1.patch against trunk revision 799126. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 210 release audit warnings (more than the trunk's current 203 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/427/console This message is automatically generated.
        Hide
        Matei Zaharia added a comment -

        Updated patch for current trunk.

        Show
        Matei Zaharia added a comment - Updated patch for current trunk.
        Hide
        Matei Zaharia added a comment -

        I've attached a design document explaining the new organization of the code and the fair scheduler design in general.

        Show
        Matei Zaharia added a comment - I've attached a design document explaining the new organization of the code and the fair scheduler design in general.
        Hide
        Matei Zaharia added a comment -

        Here's a patch adding this feature. The patch includes MAPREDUCE-548 because it depends on that issue.

        This patch performs a fairly major restructuring of the fair scheduler to support different scheduling algorithms within each pool. However, the resulting code is simpler, more flexible and more efficient than the current fair scheduler. Many of the changes in the patch are also just changes to unit tests, which were necessary because deficits were removed (affecting behavior slightly in some tests) and the names of some internal data structures changed.

        At a high level, the patch changes the scheduler to perform hierarchical scheduling, where a pool is chosen to assign a task to, and then the pool is asked to choose a task from among its jobs. The pool can choose to use either FIFO or fair sharing to do this.

        To allow the same fair sharing code to be used across pools and across jobs, the patch introduces a new abstract class called Schedulable that can represent both a job (as JobSchedulable) and a pool (as PoolSchedulable). Schedulables can be asked to assign tasks and can be queried for metrics used in various scheduling algorithms, such as current number of running tasks, demand (number of tasks required), weight, fair share, priority, etc. PoolSchedulables aggregate the metrics from the jobs they contain. There are separate sets of Schedulables for maps and reduces, to let the same code be reused for both task types.

        Apart from this large-scale change, the patch includes a few smaller but important changes that simplify the fair scheduler and let existing features to work with the new hierarchical model:

        • Deficits are no longer used for fair sharing. Instead, we just assign tasks to the job/pool with the fewest running tasks (scaled by weight). The scheduler originally used deficits because they work well in CPU and packet scheduling and because they let jobs "catch up" after temporary unfairness, but we found that this can lead to bad behavior in a system with long tasks like Hadoop (MAPREDUCE-544). In addition, preemption (MAPREDUCE-551) can now be used to ensure jobs aren't starved of their share for too long a time. Deficits also don't make sense with FIFO, so it would have been difficult to use the existing code there. Therefore this patch removes deficits and moves to a simpler form of fair sharing. The new algorithm should also let small jobs start faster, without having to wait to accumulate a deficit.
        • A new algorithm based on binary search is used for computing fair shares. Computing weighted fair shares in the presence of minimum shares and maximum shares (due to jobs whose demand is less than their share) is fairly tricky, and the previous algorithm ignlred the low-demand case and ran in quadratic time in the number of jobs. The new algorithm handles these cases, runs in linear time in the number of jobs, and is only about 60 lines of code.
        • Preemption now happens at a pool level and not at a job level. Up until now the fair scheduler has contained only per-job data structures, so each job had a minimum share assigned (a fraction of its pool) and kept track of whether it was starved. This could lead to less than the right amount of tasks being preempted if, for example, jobs' min shares rounded down to less than the share of the pool. Per-job minimum shares also can't easily be assigned in FIFO pools. Instead, the new code keeps track of min shares and starvation at the level of pools. This leads to incompatible behavior with MAPREDUCE-551, but that code has not been in any release yet so it should be fine to change it here.

        The new fair scheduler structure should make it straightforward to support features such as maximum shares for pools (MAPREDUCE-698), new scheduling algorithms within pools (e.g. shortest job first), and even sub-pools within pools (e.g. an organization pool can contain a sub-pool for each user). Over the next few days, I'll also post a design document detailing the new scheduler structure and algorithms.

        Show
        Matei Zaharia added a comment - Here's a patch adding this feature. The patch includes MAPREDUCE-548 because it depends on that issue. This patch performs a fairly major restructuring of the fair scheduler to support different scheduling algorithms within each pool. However, the resulting code is simpler, more flexible and more efficient than the current fair scheduler. Many of the changes in the patch are also just changes to unit tests, which were necessary because deficits were removed (affecting behavior slightly in some tests) and the names of some internal data structures changed. At a high level, the patch changes the scheduler to perform hierarchical scheduling, where a pool is chosen to assign a task to, and then the pool is asked to choose a task from among its jobs. The pool can choose to use either FIFO or fair sharing to do this. To allow the same fair sharing code to be used across pools and across jobs, the patch introduces a new abstract class called Schedulable that can represent both a job (as JobSchedulable) and a pool (as PoolSchedulable). Schedulables can be asked to assign tasks and can be queried for metrics used in various scheduling algorithms, such as current number of running tasks, demand (number of tasks required), weight, fair share, priority, etc. PoolSchedulables aggregate the metrics from the jobs they contain. There are separate sets of Schedulables for maps and reduces, to let the same code be reused for both task types. Apart from this large-scale change, the patch includes a few smaller but important changes that simplify the fair scheduler and let existing features to work with the new hierarchical model: Deficits are no longer used for fair sharing. Instead, we just assign tasks to the job/pool with the fewest running tasks (scaled by weight). The scheduler originally used deficits because they work well in CPU and packet scheduling and because they let jobs "catch up" after temporary unfairness, but we found that this can lead to bad behavior in a system with long tasks like Hadoop ( MAPREDUCE-544 ). In addition, preemption ( MAPREDUCE-551 ) can now be used to ensure jobs aren't starved of their share for too long a time. Deficits also don't make sense with FIFO, so it would have been difficult to use the existing code there. Therefore this patch removes deficits and moves to a simpler form of fair sharing. The new algorithm should also let small jobs start faster, without having to wait to accumulate a deficit. A new algorithm based on binary search is used for computing fair shares. Computing weighted fair shares in the presence of minimum shares and maximum shares (due to jobs whose demand is less than their share) is fairly tricky, and the previous algorithm ignlred the low-demand case and ran in quadratic time in the number of jobs. The new algorithm handles these cases, runs in linear time in the number of jobs, and is only about 60 lines of code. Preemption now happens at a pool level and not at a job level. Up until now the fair scheduler has contained only per-job data structures, so each job had a minimum share assigned (a fraction of its pool) and kept track of whether it was starved. This could lead to less than the right amount of tasks being preempted if, for example, jobs' min shares rounded down to less than the share of the pool. Per-job minimum shares also can't easily be assigned in FIFO pools. Instead, the new code keeps track of min shares and starvation at the level of pools. This leads to incompatible behavior with MAPREDUCE-551 , but that code has not been in any release yet so it should be fine to change it here. The new fair scheduler structure should make it straightforward to support features such as maximum shares for pools ( MAPREDUCE-698 ), new scheduling algorithms within pools (e.g. shortest job first), and even sub-pools within pools (e.g. an organization pool can contain a sub-pool for each user). Over the next few days, I'll also post a design document detailing the new scheduler structure and algorithms.

          People

          • Assignee:
            Matei Zaharia
            Reporter:
            Matei Zaharia
          • Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development