Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: contrib/fair-share
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Added support for preemption in the fair scheduler. The new configuration options for enabling this are described in the fair scheduler documentation.

      Description

      Task preemption is necessary in a multi-user Hadoop cluster for two reasons: users might submit long-running tasks by mistake (e.g. an infinite loop in a map program), or tasks may be long due to having to process large amounts of data. The Fair Scheduler (HADOOP-3746) has a concept of guaranteed capacity for certain queues, as well as a goal of providing good performance for interactive jobs on average through fair sharing. Therefore, it will support preempting under two conditions:
      1) A job isn't getting its guaranteed share of the cluster for at least T1 seconds.
      2) A job is getting significantly less than its fair share for T2 seconds (e.g. less than half its share).

      T1 will be chosen smaller than T2 (and will be configurable per queue) to meet guarantees quickly. T2 is meant as a last resort in case non-critical jobs in queues with no guaranteed capacity are being starved.

      When deciding which tasks to kill to make room for the job, we will use the following heuristics:

      • Look for tasks to kill only in jobs that have more than their fair share, ordering these by deficit (most overscheduled jobs first).
      • For maps: kill tasks that have run for the least amount of time (limiting wasted time).
      • For reduces: similar to maps, but give extra preference for reduces in the copy phase where there is not much map output per task (at Facebook, we have observed this to be the main time we need preemption - when a job has a long map phase and its reducers are mostly sitting idle and filling up slots).
      1. mapreduce-551-branch20.txt
        127 kB
        Todd Lipcon
      2. hadoop-4665-v7e.patch
        132 kB
        Matei Zaharia
      3. hadoop-4665-v7d.patch
        131 kB
        Matei Zaharia
      4. hadoop-4665-v7c.patch
        131 kB
        Matei Zaharia
      5. hadoop-4665-v7b.patch
        130 kB
        Matei Zaharia
      6. hadoop-4665-v7.patch
        130 kB
        Matei Zaharia
      7. hadoop-4665-v6.patch
        98 kB
        Matei Zaharia
      8. hadoop-4665-v5.patch
        98 kB
        Matei Zaharia
      9. hadoop-4665-v4.patch
        44 kB
        Matei Zaharia
      10. hadoop-4665-v3.patch
        44 kB
        Matei Zaharia
      11. hadoop-4665-v2.patch
        44 kB
        Matei Zaharia
      12. hadoop-4665-v1b.patch
        45 kB
        Matei Zaharia
      13. hadoop-4665-v1.patch
        45 kB
        Matei Zaharia
      14. fs-preemption-v0.patch
        57 kB
        Matei Zaharia
      15. fairshare-patches.tar.gz
        31 kB
        Todd Lipcon

        Issue Links

          Activity

          Hide
          Todd Lipcon added a comment -

          The branch-20 patch previously posted here had a backport error. There were also two other bugs (MAPREDUCE-1070 and MAPREDUCE-1089) found in fairshare since that patch. Here's a tarball with a working patch series from 0.20.1 release in case anyone is interested in applying on their own (this should not go into 0.20.2)

          Show
          Todd Lipcon added a comment - The branch-20 patch previously posted here had a backport error. There were also two other bugs ( MAPREDUCE-1070 and MAPREDUCE-1089 ) found in fairshare since that patch. Here's a tarball with a working patch series from 0.20.1 release in case anyone is interested in applying on their own (this should not go into 0.20.2)
          Hide
          dhruba borthakur added a comment -

          Thanks Todd, we might give it a whirl with 0.20.

          Show
          dhruba borthakur added a comment - Thanks Todd, we might give it a whirl with 0.20.
          Hide
          Todd Lipcon added a comment -

          Attaching a patch against branch-20 in case anyone finds this useful. (this should not be slated for inclusion in 0.20.1 since it is a new feature – just attaching for those who would like to patch it in)

          Show
          Todd Lipcon added a comment - Attaching a patch against branch-20 in case anyone finds this useful. (this should not be slated for inclusion in 0.20.1 since it is a new feature – just attaching for those who would like to patch it in)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/ )
          Hide
          Matei Zaharia added a comment -

          Thanks Vinod. I've committed this. I had to do a little more merging due to a change in TaskTrackerManager and EagerTaskInitializationListener.

          Show
          Matei Zaharia added a comment - Thanks Vinod. I've committed this. I had to do a little more merging due to a change in TaskTrackerManager and EagerTaskInitializationListener.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I'll commit it if those pass.

          +1. Please commit it before some other patch goes in and breaks this patch again

          Show
          Vinod Kumar Vavilapalli added a comment - I'll commit it if those pass. +1. Please commit it before some other patch goes in and breaks this patch again
          Hide
          Matei Zaharia added a comment -

          I've updated the patch to work with the changes in MAPREDUCE-516. The updates were fairly minor. The docs and the the fair scheduler's tests still run successfully. I'm currently running the other tests through ant test. I'll commit it if those pass.

          Show
          Matei Zaharia added a comment - I've updated the patch to work with the changes in MAPREDUCE-516 . The updates were fairly minor. The docs and the the fair scheduler's tests still run successfully. I'm currently running the other tests through ant test. I'll commit it if those pass.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Sigh. I was hoping that you will commit the patch after generating the new patch. By the time you read this, MAPREDUCE-516 mostly will be committed . This patch will need an update after MAPREDUCE-516.

          Show
          Vinod Kumar Vavilapalli added a comment - Sigh . I was hoping that you will commit the patch after generating the new patch. By the time you read this, MAPREDUCE-516 mostly will be committed . This patch will need an update after MAPREDUCE-516 .
          Hide
          Matei Zaharia added a comment -

          I ran ant test and all the tests passed except for TestReduceFetch, which is a known problem due to MAPREDUCE-433. I also ran ant docs and successfully built the documentation.

          Show
          Matei Zaharia added a comment - I ran ant test and all the tests passed except for TestReduceFetch, which is a known problem due to MAPREDUCE-433 . I also ran ant docs and successfully built the documentation.
          Hide
          Matei Zaharia added a comment -

          Here's a patch against the mapreduce SVN. I still have to run all the tests, but if they work, can I commit this, Vinod?

          Show
          Matei Zaharia added a comment - Here's a patch against the mapreduce SVN. I still have to run all the tests, but if they work, can I commit this, Vinod?
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Reviewed the latest changes and the test cases too, look good. Quickly looked at the documentation too, but couldn't run ant docs because of workspace problems. Hudson seems to have stuck again, please run "ant test" to make sure everything is fine and that the docs are building properly.

          The changes are committable, but you need to make minor changes to the patch w.r.t directory structures so as to reflect the latest project split up. Can you please do so?

          Show
          Vinod Kumar Vavilapalli added a comment - Reviewed the latest changes and the test cases too, look good. Quickly looked at the documentation too, but couldn't run ant docs because of workspace problems. Hudson seems to have stuck again, please run "ant test" to make sure everything is fine and that the docs are building properly. The changes are committable, but you need to make minor changes to the patch w.r.t directory structures so as to reflect the latest project split up. Can you please do so?
          Hide
          Matei Zaharia added a comment -

          Nicholas, that is exactly the case this patch fixes. The unit tests cover it.

          Show
          Matei Zaharia added a comment - Nicholas, that is exactly the case this patch fixes. The unit tests cover it.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          It would be great if we can have a unit test testing the case given in HADOOP-5701. Not sure if you already have it. I did not check the patch.

          Show
          Tsz Wo Nicholas Sze added a comment - It would be great if we can have a unit test testing the case given in HADOOP-5701 . Not sure if you already have it. I did not check the patch.
          Hide
          Matei Zaharia added a comment -

          I've attached a new patch taking into account Vinod's two comments and keeping up with trunk. Vinod, do you have a rough ETA on when you'll finish looking at this?

          Show
          Matei Zaharia added a comment - I've attached a new patch taking into account Vinod's two comments and keeping up with trunk. Vinod, do you have a rough ETA on when you'll finish looking at this?
          Hide
          Matei Zaharia added a comment -

          Hi Vinod,

          Sorry for not addressing the half fair share point earlier, it looks like I forgot to post that. The difference between isStarvedForFairShare and tasksDueToFairShare is intentional. I want the threshold for triggering fair share preemption to be a lot lower than the fair share so that it doesn't trigger unless something is going horribly wrong in the cluster. The reason is that in standard use of the fair scheduler, we expect any critical ("production") jobs to have min shares set, which are enforced much more precisely (and potentially with a smaller timeout). The fair share is for those jobs that are not critical and that we're okay with being a little unfair to if that reduces wasted work. So the service model is that we only do preemption if the job is being starved very badly. However, when we do preemption, we do bring you up to your full fair share, because at that point it's clear that you've been starved badly for a long time. Once you are at your full fair share, it will be easy for you to remain there as you'll be given chances to reuse those slots when your tasks finish. If some users request for a stricter enforcement of fair shares, we can make the "half" part configurable later, but we decided this model is a good way to prevent unnecessary preemption and exchange of slots back and forth between jobs, while also not being too unfair.

          I'll make a patch with the other changes sometime in the next few days or maybe after I see some of your comments.

          The changes in test cases and docs are indeed huge. The request was huge (and important), and I took this opportunity to clean up the fair scheduler docs overall.

          Show
          Matei Zaharia added a comment - Hi Vinod, Sorry for not addressing the half fair share point earlier, it looks like I forgot to post that. The difference between isStarvedForFairShare and tasksDueToFairShare is intentional. I want the threshold for triggering fair share preemption to be a lot lower than the fair share so that it doesn't trigger unless something is going horribly wrong in the cluster. The reason is that in standard use of the fair scheduler, we expect any critical ("production") jobs to have min shares set, which are enforced much more precisely (and potentially with a smaller timeout). The fair share is for those jobs that are not critical and that we're okay with being a little unfair to if that reduces wasted work. So the service model is that we only do preemption if the job is being starved very badly. However, when we do preemption, we do bring you up to your full fair share, because at that point it's clear that you've been starved badly for a long time. Once you are at your full fair share, it will be easy for you to remain there as you'll be given chances to reuse those slots when your tasks finish. If some users request for a stricter enforcement of fair shares, we can make the "half" part configurable later, but we decided this model is a good way to prevent unnecessary preemption and exchange of slots back and forth between jobs, while also not being too unfair. I'll make a patch with the other changes sometime in the next few days or maybe after I see some of your comments. The changes in test cases and docs are indeed huge. The request was huge (and important), and I took this opportunity to clean up the fair scheduler docs overall.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Matei, I have (re)started looking at the patch. The changes look good overall, except the following points:

          • preemptionInterval variable is initialized to 30000 whereas the default value is 15000. Shouldn't they be consistent?
          • EagerTaskInitializationListener is not removed from the list of listeners in the terminate method
          • You seem to have missed one of my earlier points:

            The count tasksDueToFairShare seems to be calculated to see if full fair-share of slots are allotted or not instead of the advertised half of fairshare. I think this is a mistake as isStarvedForFairShare() is checking for half of fair-share. Or am I missing something?

          The changes in test-cases and documentation seem to be huge. It'll take till tomorrow for me to complete the review of test-cases and documentation. Thanks for the patience.

          Show
          Vinod Kumar Vavilapalli added a comment - Matei, I have (re)started looking at the patch. The changes look good overall, except the following points: preemptionInterval variable is initialized to 30000 whereas the default value is 15000. Shouldn't they be consistent? EagerTaskInitializationListener is not removed from the list of listeners in the terminate method You seem to have missed one of my earlier points: The count tasksDueToFairShare seems to be calculated to see if full fair-share of slots are allotted or not instead of the advertised half of fairshare. I think this is a mistake as isStarvedForFairShare() is checking for half of fair-share. Or am I missing something? The changes in test-cases and documentation seem to be huge. It'll take till tomorrow for me to complete the review of test-cases and documentation. Thanks for the patience.
          Hide
          Matei Zaharia added a comment -

          Thanks, that makes sense. I added the license to the .java file.

          Show
          Matei Zaharia added a comment - Thanks, that makes sense. I added the license to the .java file.
          Hide
          Tom White added a comment -

          These files are missing Apache license headers. Config files in Hadoop don't seem to include a header, so you can probably skip the one for fair-scheduler.xml.

          Show
          Tom White added a comment - These files are missing Apache license headers. Config files in Hadoop don't seem to include a header, so you can probably skip the one for fair-scheduler.xml.
          Hide
          Matei Zaharia added a comment -

          The test failure seems to be unrelated to my patch (HADOOP-5869).

          I'm not sure what the release audit warning means. This is what it says:

          5d4
          <      [java]  !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-781343_HADOOP-4665_PATCH-12409635/conf/fair-scheduler.xml
          300d298
          <      [java]  !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-781343_HADOOP-4665_PATCH-12409635/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/FairSchedulerEventLog.java
          
          Show
          Matei Zaharia added a comment - The test failure seems to be unrelated to my patch ( HADOOP-5869 ). I'm not sure what the release audit warning means. This is what it says: 5d4 < [java] !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-781343_HADOOP-4665_PATCH-12409635/conf/fair-scheduler.xml 300d298 < [java] !????? /home/hudson/hudson-slave/workspace/Hadoop-Patch-vesta.apache.org/trunk/build/hadoop-781343_HADOOP-4665_PATCH-12409635/src/contrib/fairscheduler/src/java/org/apache/hadoop/mapred/FairSchedulerEventLog.java
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12409635/hadoop-4665-v7.patch
          against trunk revision 781343.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          -1 release audit. The applied patch generated 494 release audit warnings (more than the trunk's current 492 warnings).

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409635/hadoop-4665-v7.patch against trunk revision 781343. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 494 release audit warnings (more than the trunk's current 492 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/457/console This message is automatically generated.
          Hide
          Matei Zaharia added a comment -

          Attaching a new patch which includes updated documentation.

          Show
          Matei Zaharia added a comment - Attaching a new patch which includes updated documentation.
          Hide
          Matei Zaharia added a comment -

          Updated patch to keep up with trunk. Would appreciate a review.

          Show
          Matei Zaharia added a comment - Updated patch to keep up with trunk. Would appreciate a review.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12407422/hadoop-4665-v5.patch
          against trunk revision 772960.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/310/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12407422/hadoop-4665-v5.patch against trunk revision 772960. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/310/console This message is automatically generated.
          Hide
          Matei Zaharia added a comment -

          Here's a new patch. It incorporates Vinod's comments, except for changing the event log format to key-value pairs. The event log in its current state is meant to be used only for debugging and has a relatively simple tab-separated format, so I didn't want to complicate the API to it. We can have another JIRA for adding a parser class for it and turning it into more than a debug tool if there's demand for that. It also adds five unit tests for preemption and a default config file for the fair scheduler. The patch also makes the update interval configurable (since I was did that with the other two periodic check intervals).

          Included in this patch is a fairly significant evolution of the fair scheduler unit testing framework, which adds tracking of tasks in FakeJobInProgress to allow for preemption to be tested meaningfully. The FakeJobInProgress and FakeTaskInProgress have some commonalities with the ones in the capacity scheduler, but unfortunately I wasn't able to use those directly because some of the classes used in the fair scheduler, such as Clock, don't have equivalents there. It would be nice to create a common testing framework for schedulers, but that should be a separate JIRA. I also think that the ultimate solution for that is not to make an elaborate FakeJobInProgress and related classes, but rather to make MiniMRCluster more user-friendly and switch all tests into it. We can also consider making the Clock class be used by all the MR code so that tests can run at an accelerated rate on these simulated clusters.

          One other thing I will need to add in this patch is documentation for the preemption params. However, the other changes can be reviewed right now.

          Show
          Matei Zaharia added a comment - Here's a new patch. It incorporates Vinod's comments, except for changing the event log format to key-value pairs. The event log in its current state is meant to be used only for debugging and has a relatively simple tab-separated format, so I didn't want to complicate the API to it. We can have another JIRA for adding a parser class for it and turning it into more than a debug tool if there's demand for that. It also adds five unit tests for preemption and a default config file for the fair scheduler. The patch also makes the update interval configurable (since I was did that with the other two periodic check intervals). Included in this patch is a fairly significant evolution of the fair scheduler unit testing framework, which adds tracking of tasks in FakeJobInProgress to allow for preemption to be tested meaningfully. The FakeJobInProgress and FakeTaskInProgress have some commonalities with the ones in the capacity scheduler, but unfortunately I wasn't able to use those directly because some of the classes used in the fair scheduler, such as Clock, don't have equivalents there. It would be nice to create a common testing framework for schedulers, but that should be a separate JIRA. I also think that the ultimate solution for that is not to make an elaborate FakeJobInProgress and related classes, but rather to make MiniMRCluster more user-friendly and switch all tests into it. We can also consider making the Clock class be used by all the MR code so that tests can run at an accelerated rate on these simulated clusters. One other thing I will need to add in this patch is documentation for the preemption params. However, the other changes can be reviewed right now.
          Hide
          Chris Douglas added a comment -

          The current patch no longer applies to trunk

          Show
          Chris Douglas added a comment - The current patch no longer applies to trunk
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Agree with Hemanth on having two knobs - one for completely disabling preemption and one for enabling a dry run.

          Regarding leaving some heartbeats before killing tasks, I agree with your arguments and am fine with leaving that for now as the timeouts may indeed turn out to be in the order of minutes like you are saying. If it becomes a need later, we can do it in future in another JIRA issue.

          I'm +1 to the rest of the arguments. So I'll be fine with the patch once the rest of the code comments are incorporated along with a knob for disabling preemption altogether.

          Show
          Vinod Kumar Vavilapalli added a comment - Agree with Hemanth on having two knobs - one for completely disabling preemption and one for enabling a dry run. Regarding leaving some heartbeats before killing tasks, I agree with your arguments and am fine with leaving that for now as the timeouts may indeed turn out to be in the order of minutes like you are saying. If it becomes a need later, we can do it in future in another JIRA issue. I'm +1 to the rest of the arguments. So I'll be fine with the patch once the rest of the code comments are incorporated along with a knob for disabling preemption altogether.
          Hide
          Hemanth Yamijala added a comment -

          Matei, I've not been looking at this code much, but based on the discussion, I have only one comment: regarding the turning off of pre-emption.

          Your use case of organizations wanting to try out with pre-emption disabled, but still seeing when pre-emption would happen seems to me like a dry-run mode that you can see in utilities like an RPM update. As you've explained, looks like there are use cases for this.

          From our work with the capacity scheduler, we've found there are environments where pre-emption is indeed not necessary. Even when it exists, it has proved to be a complex feature to reason about. From this perspective, it seems like it may make sense to provide an option to completely turn it off and have reasonable confidence that nothing related to pre-emption would be in effect, including any additional computation that it requires.

          Hence, my suggestion is the following: have a flag to truly turn off pre-emption and have a variable that allows a dry-run of pre-emption when it is enabled. I believe this may not be a very difficult change ? (Indeed, I've been thinking of cases where a dry-run of the entire scheduling logic makes sense - for e.g. to get a 'scheduling log' that can be replayed).

          The flip side of my proposal is an additional configuration option. But depending on what we think the right defaults are, we can still make the configuration easy for end users, no ? To that extent, your arguments about the proposed default values are fine with me.

          Show
          Hemanth Yamijala added a comment - Matei, I've not been looking at this code much, but based on the discussion, I have only one comment: regarding the turning off of pre-emption. Your use case of organizations wanting to try out with pre-emption disabled, but still seeing when pre-emption would happen seems to me like a dry-run mode that you can see in utilities like an RPM update. As you've explained, looks like there are use cases for this. From our work with the capacity scheduler, we've found there are environments where pre-emption is indeed not necessary. Even when it exists, it has proved to be a complex feature to reason about. From this perspective, it seems like it may make sense to provide an option to completely turn it off and have reasonable confidence that nothing related to pre-emption would be in effect, including any additional computation that it requires. Hence, my suggestion is the following: have a flag to truly turn off pre-emption and have a variable that allows a dry-run of pre-emption when it is enabled. I believe this may not be a very difficult change ? (Indeed, I've been thinking of cases where a dry-run of the entire scheduling logic makes sense - for e.g. to get a 'scheduling log' that can be replayed). The flip side of my proposal is an additional configuration option. But depending on what we think the right defaults are, we can still make the configuration easy for end users, no ? To that extent, your arguments about the proposed default values are fine with me.
          Hide
          Matei Zaharia added a comment -

          Vinod, have you had a chance to look at my arguments above? I'd really like to get this patch committed soon because I have others that depend on it.

          Show
          Matei Zaharia added a comment - Vinod, have you had a chance to look at my arguments above? I'd really like to get this patch committed soon because I have others that depend on it.
          Hide
          Matei Zaharia added a comment - - edited

          Hi Vinod,

          A few comments:

          • Is the 2-3 heartbeats for preemption really necessary? I imagine timeouts will be on the order of minutes, so a few seconds won't make a big difference. Although thinking of the timeout as an SLA is nice, it's also equally easy to think of it as "this is when it can start killing tasks". To me, putting this extra 2-3 heartbeats thing in seems like unnecessary complexity.
          • The reason the preemption-enabled check is deep inside the method is to give you the ability to turn preemption off but see the SHOULD_PREEMPT log messages to figure out when your cluster would preempt tasks given certain settings. We wanted this at Facebook so that we can add some timeouts, count the SHOULD_PREEMPT messages over a week, and be sure that the settings chosen are good without actually losing a lot of tasks if there's a mistake. I think this is a good feature to keep for other people who are thinking of turning on preemption.
          • There actually is a way to set a default preemption timeout as in Joydeep's comment - you can set defaultMinSharePreemptionTimeout in the XML file. The code for this is in PoolManager.
          • The default settings of preemptionEnabled=true and no timeouts are to make preemption easy to configure gradually. We expect that most people will start out not wanting preemption, because it creates an extra worry of "have we set it too low". Then as people start running pools with "production" jobs (with min share set), they may want to enable preemption just for these jobs. They would be able to do that by just adding a preemptionTimeout entry to those pools in the config, and it would be active without needing to restart the JobTracker. Then if they see non-production jobs suffering, they could enable the fairSharePreemptionTimeout, again without requiring a cluster restart. The only reason to also provide a preemptionEnabled setting in the jobconf is for the testing purpose I mentioned above, where an organization switching over to preemption in production can figure out first whether it will kill too many tasks. Overall, my goal with all the fair scheduler config is to make it as easy as possible to use "out of the box". You don't need to define pools in advance, you don't need to define min shares or weights in advance, you don't need to decide when to use preemption in advance, etc, and the only setting you need in mapred-site.xml is the one that sets Hadoop to use the fair scheduler. Then as you decide you want the more advanced features, you enable them gradually. I actually think there are strong advantages to this over your proposal of having preemptionEnabled=false and having non-infinite default timeouts so again I'd like more motivation before making such a large code change. The other factor is that Facebook has been using the current version of the preemption code and found the current features useful.

          I'll take a look at your other comments later this week. Regarding the code reuse in preemptTasks, it is indeed based on the one in the capacity scheduler, but I'd like to make refactoring that a separate issue from this JIRA. The right thing might be to have some of that functionality in TaskScheduler.

          Show
          Matei Zaharia added a comment - - edited Hi Vinod, A few comments: Is the 2-3 heartbeats for preemption really necessary? I imagine timeouts will be on the order of minutes, so a few seconds won't make a big difference. Although thinking of the timeout as an SLA is nice, it's also equally easy to think of it as "this is when it can start killing tasks". To me, putting this extra 2-3 heartbeats thing in seems like unnecessary complexity. The reason the preemption-enabled check is deep inside the method is to give you the ability to turn preemption off but see the SHOULD_PREEMPT log messages to figure out when your cluster would preempt tasks given certain settings. We wanted this at Facebook so that we can add some timeouts, count the SHOULD_PREEMPT messages over a week, and be sure that the settings chosen are good without actually losing a lot of tasks if there's a mistake. I think this is a good feature to keep for other people who are thinking of turning on preemption. There actually is a way to set a default preemption timeout as in Joydeep's comment - you can set defaultMinSharePreemptionTimeout in the XML file. The code for this is in PoolManager. The default settings of preemptionEnabled=true and no timeouts are to make preemption easy to configure gradually. We expect that most people will start out not wanting preemption, because it creates an extra worry of "have we set it too low". Then as people start running pools with "production" jobs (with min share set), they may want to enable preemption just for these jobs. They would be able to do that by just adding a preemptionTimeout entry to those pools in the config, and it would be active without needing to restart the JobTracker. Then if they see non-production jobs suffering, they could enable the fairSharePreemptionTimeout, again without requiring a cluster restart. The only reason to also provide a preemptionEnabled setting in the jobconf is for the testing purpose I mentioned above, where an organization switching over to preemption in production can figure out first whether it will kill too many tasks. Overall, my goal with all the fair scheduler config is to make it as easy as possible to use "out of the box". You don't need to define pools in advance, you don't need to define min shares or weights in advance, you don't need to decide when to use preemption in advance, etc, and the only setting you need in mapred-site.xml is the one that sets Hadoop to use the fair scheduler. Then as you decide you want the more advanced features, you enable them gradually. I actually think there are strong advantages to this over your proposal of having preemptionEnabled=false and having non-infinite default timeouts so again I'd like more motivation before making such a large code change. The other factor is that Facebook has been using the current version of the preemption code and found the current features useful. I'll take a look at your other comments later this week. Regarding the code reuse in preemptTasks, it is indeed based on the one in the capacity scheduler, but I'd like to make refactoring that a separate issue from this JIRA. The right thing might be to have some of that functionality in TaskScheduler.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Oh, and Joydeep's previous comment isn't addressed yet.

          default preemption timeout(s) for pools. right now unless we configure on a per pool basis - the default timeout is infinite.

          Because we already have a different plug for disabling preemption altogether, I think this default timeout as well as fairSharePreemptionTimeout should have some finite value.

          Show
          Vinod Kumar Vavilapalli added a comment - Oh, and Joydeep's previous comment isn't addressed yet. default preemption timeout(s) for pools. right now unless we configure on a per pool basis - the default timeout is infinite. Because we already have a different plug for disabling preemption altogether, I think this default timeout as well as fairSharePreemptionTimeout should have some finite value.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Some more miscellaneous points:

          • With the introduction of new timeouts, I think the importance of a template file for the allocations increases. I remember you saying something about it on some jira. Have you filed one?
          • This patch adds FairSchedulerEventLog for logging various events in the scheduler in machine-readable format. But there is no place from where utilities can determine the format of the log records: How should we track the event log records' format, add some schema file? Or alter the logs to be a list of key-value pairs similar to JobHistory instead of just values?
          • There is a lot of common code between
            int org.apache.hadoop.mapred.FairScheduler.preemptTasks(JobInProgress job, TaskType type, int maxToPreempt)

            and

            int org.apache.hadoop.mapred.CapacityTaskScheduler.MapSchedulingMgr.killTasksFromJob(JobInProgress job, int tasksToKill)

            . In fact most of it is the same. I think we should somehow try to refactor this common code. Don't know if we want to do it in this jira itself or not.

          Show
          Vinod Kumar Vavilapalli added a comment - Some more miscellaneous points: With the introduction of new timeouts, I think the importance of a template file for the allocations increases. I remember you saying something about it on some jira. Have you filed one? This patch adds FairSchedulerEventLog for logging various events in the scheduler in machine-readable format. But there is no place from where utilities can determine the format of the log records: How should we track the event log records' format, add some schema file? Or alter the logs to be a list of key-value pairs similar to JobHistory instead of just values? There is a lot of common code between int org.apache.hadoop.mapred.FairScheduler.preemptTasks(JobInProgress job, TaskType type, int maxToPreempt) and int org.apache.hadoop.mapred.CapacityTaskScheduler.MapSchedulingMgr.killTasksFromJob(JobInProgress job, int tasksToKill) . In fact most of it is the same. I think we should somehow try to refactor this common code. Don't know if we want to do it in this jira itself or not.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Here's a more comprehensive review:

          Major points:

          • As it's a new feature, I think preemption(mapred.fairscheduler.preemption) should be disabled(false) by default.
          • Can we do something like above stated proposal: to preempt a task when a job's fairshare/minshare are not met within PREEMPTION_TIMEOUT leaving 2/3 heartbeats for the task to actually get killed.
          • DUMP_INTERVAL and PREEMPTION_INTERVAL should be configurable. The variables themselves can be package-private instead of public.
          • The class FairSchedulerEventLog can just be package-private. So do all the methods inside - init, log, shutdown and isEnabled - they don't need to be public as of now.
          • FairScheduler.preemptTasksIfNecessary() method:
            • This method does a Collections.reverse(jobs) after sorting the jobs. We can just traverse the list in reverse order to get the desired effect here.
            • The check as to whether FairScheduler will do preemption(preemptionEnabled && !useFifo) is done deep inside - all the stats are calculated and then only preemption is skipped if not needed. Can we take this check, may be, to the beginning of preemptTasksIfNecessary() method or inside update() method itself.
          • FairScheduler.tasksToPreempt() method:
            • The count tasksDueToFairShare seems to be calculated to see if full fair-share of slots are allotted or not instead of the advertised half of fairshare. I think this is a mistake as isStarvedForFairShare() is checking for half of fair-share. Or am I missing something?
            • EventLogs in this method are a bit confusing if the job is short on both minshare and fairshare. In this case, it is roughly giving an impression that we are preempting twice. I think it would be clear to just log that the job is short by so many slots w.r.t minshare and w.r.t fairshare. And in the end, just before returning we can simply say the exact number of slots we are going to preempt.
          • FairScheduler.start() method:
            • I think that the call loadMgr.setEventLog(eventLog) should be after eventLog is initialized with a new FairSchedulerEvenLog object.
          • TestFairScheduler.java:
            • You have added setup and cleanup tasks creation code in initTasks() method. Is there any specific reason for doing this? In any case, much of it duplicates code from JobInProgress.initTasks(), if we really want, we can refactor this code into a new method say JobInProgress.createSpecialTasks().
            • There are no new test-cases related to preemption. I think we should have one/some.

          Minor nits:

          • Javadoc for getFairSharePreemptionTimeout() is incomplete. Also, if I am understanding correctly FairSharePreemptionTimeout is the same for all pools/jobs.
          • The xml tag for minSharePreemptionTimeouts is currently preemptionTimeout. It can better be minSharePreemptionTimeout.
          • update thread: The log message "Failed to update fair share calculations" can better be "Exception in Update thread". This because,now , update thread does more than just updating calculations.
          Show
          Vinod Kumar Vavilapalli added a comment - Here's a more comprehensive review: Major points: As it's a new feature, I think preemption(mapred.fairscheduler.preemption) should be disabled(false) by default. Can we do something like above stated proposal: to preempt a task when a job's fairshare/minshare are not met within PREEMPTION_TIMEOUT leaving 2/3 heartbeats for the task to actually get killed. DUMP_INTERVAL and PREEMPTION_INTERVAL should be configurable. The variables themselves can be package-private instead of public. The class FairSchedulerEventLog can just be package-private. So do all the methods inside - init, log, shutdown and isEnabled - they don't need to be public as of now. FairScheduler.preemptTasksIfNecessary() method: This method does a Collections.reverse(jobs) after sorting the jobs. We can just traverse the list in reverse order to get the desired effect here. The check as to whether FairScheduler will do preemption(preemptionEnabled && !useFifo) is done deep inside - all the stats are calculated and then only preemption is skipped if not needed. Can we take this check, may be, to the beginning of preemptTasksIfNecessary() method or inside update() method itself. FairScheduler.tasksToPreempt() method: The count tasksDueToFairShare seems to be calculated to see if full fair-share of slots are allotted or not instead of the advertised half of fairshare. I think this is a mistake as isStarvedForFairShare() is checking for half of fair-share. Or am I missing something? EventLogs in this method are a bit confusing if the job is short on both minshare and fairshare. In this case, it is roughly giving an impression that we are preempting twice. I think it would be clear to just log that the job is short by so many slots w.r.t minshare and w.r.t fairshare. And in the end, just before returning we can simply say the exact number of slots we are going to preempt. FairScheduler.start() method: I think that the call loadMgr.setEventLog(eventLog) should be after eventLog is initialized with a new FairSchedulerEvenLog object. TestFairScheduler.java: You have added setup and cleanup tasks creation code in initTasks() method. Is there any specific reason for doing this? In any case, much of it duplicates code from JobInProgress.initTasks(), if we really want, we can refactor this code into a new method say JobInProgress.createSpecialTasks(). There are no new test-cases related to preemption. I think we should have one/some. Minor nits: Javadoc for getFairSharePreemptionTimeout() is incomplete. Also, if I am understanding correctly FairSharePreemptionTimeout is the same for all pools/jobs. The xml tag for minSharePreemptionTimeouts is currently preemptionTimeout. It can better be minSharePreemptionTimeout. update thread: The log message "Failed to update fair share calculations" can better be "Exception in Update thread". This because,now , update thread does more than just updating calculations.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Sorry for the late reply, was on a long weekend.

          Your explanation w.r.t the min-shares of pools clarified my doubts. Thanks!

          I have discussed the overall approach of this patch with Hemant, Sreekanth and Rahul offline. We concluded on one slight improvement - the preemption patch of capacity-scheduler treats the preemption timeouts to be a kind of SLA for the pool/queue and so leaves a couple of heartbeats for slots to become free after it preempts a task. Can we do something like that here? - Essentially, the proposal is to preempt a task when a job's fairshare/minshare are not met within PREEMPTION_TIMEOUT- 2/3 heartbeats.

          Few other code comments I have:

          • Can we make PREEMPTION_INTERVAL configurable?
          • The check as to whether FairScheduler will do preemption(preemptionEnabled && !useFifo) is done deep inside - all the stats are calculated and then only preemption is skipped if not needed. Can we take this check, may be, to the beginning of preemptTasksIfNecessary() method or inside update() method itself.
          • The class FairSchedulerEventLog can just be package-private. So do all the methods inside - init, log, shutdown and isEnabled - they don't need to be public as of now.

          Again, sorry for the late reply. Appreciate your patience.

          Show
          Vinod Kumar Vavilapalli added a comment - Sorry for the late reply, was on a long weekend. Your explanation w.r.t the min-shares of pools clarified my doubts. Thanks! I have discussed the overall approach of this patch with Hemant, Sreekanth and Rahul offline. We concluded on one slight improvement - the preemption patch of capacity-scheduler treats the preemption timeouts to be a kind of SLA for the pool/queue and so leaves a couple of heartbeats for slots to become free after it preempts a task. Can we do something like that here? - Essentially, the proposal is to preempt a task when a job's fairshare/minshare are not met within PREEMPTION_TIMEOUT- 2/3 heartbeats. Few other code comments I have: Can we make PREEMPTION_INTERVAL configurable? The check as to whether FairScheduler will do preemption(preemptionEnabled && !useFifo) is done deep inside - all the stats are calculated and then only preemption is skipped if not needed. Can we take this check, may be, to the beginning of preemptTasksIfNecessary() method or inside update() method itself. The class FairSchedulerEventLog can just be package-private. So do all the methods inside - init, log, shutdown and isEnabled - they don't need to be public as of now. Again, sorry for the late reply. Appreciate your patience.
          Hide
          Matei Zaharia added a comment -

          The test failures seem to be unrelated to my patch - they happen in unmodified trunk too. One of them is HADOOP-5068. Is it okay for me to commit my patch?

          Show
          Matei Zaharia added a comment - The test failures seem to be unrelated to my patch - they happen in unmodified trunk too. One of them is HADOOP-5068 . Is it okay for me to commit my patch?
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12403578/hadoop-4665-v4.patch
          against trunk revision 758425.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12403578/hadoop-4665-v4.patch against trunk revision 758425. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/138/console This message is automatically generated.
          Hide
          Matei Zaharia added a comment -

          The patch actually makes sure that no job is ever brought below its fair or min share. This happens in preemptTasks, where we compute tasksToLeave and make sure we leave the job with at least that many. Since each pool's min and fair share is distributed among the jobs in that pool, this will ensure that pool shares are also kept. So the deficit thing is just a global ordering to service the jobs that have been starving the most first, but in following it, we also ensure that we never bring jobs below their shares.

          Actually when we remove deficits from the scheduler (which I already have working code for; it's just too big a change to push that, 4665 and 4667 in the same JIRA), the logic will be simpler. We'll just look for jobs that are far below their fair share (as a ratio of share) and preempt from jobs that are far above their fair share (again as a ratio). Then there won't be this confusion about whether we can have a service order based on deficits or not. This patch is just an attempt to get some of that code into the scheduler before pushing the big change.. If you'd prefer that I post a patch for the new non-deficit-based fair scheduler instead, I can do that too.

          Show
          Matei Zaharia added a comment - The patch actually makes sure that no job is ever brought below its fair or min share. This happens in preemptTasks, where we compute tasksToLeave and make sure we leave the job with at least that many. Since each pool's min and fair share is distributed among the jobs in that pool, this will ensure that pool shares are also kept. So the deficit thing is just a global ordering to service the jobs that have been starving the most first, but in following it, we also ensure that we never bring jobs below their shares. Actually when we remove deficits from the scheduler (which I already have working code for; it's just too big a change to push that, 4665 and 4667 in the same JIRA), the logic will be simpler. We'll just look for jobs that are far below their fair share (as a ratio of share) and preempt from jobs that are far above their fair share (again as a ratio). Then there won't be this confusion about whether we can have a service order based on deficits or not. This patch is just an attempt to get some of that code into the scheduler before pushing the big change.. If you'd prefer that I post a patch for the new non-deficit-based fair scheduler instead, I can do that too.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I've started looking at the patch. If I understand correctly the way preemption is done, the jobs are sorted in the order of deficits and tasks are taken away from the jobs at the bottom of the list (positive deficits) and given to the jobs with negative deficits at the top. By doing this, though we are catering to job's fair share, it doesn't look like we are serving a queue/pool 's guaranteed shares.

          So, a queue/pool can be running only its min-maps/min-reduces, but still tasks can be preempted from jobs of this queue. Is that correct? If that's the case, then that doesn't sound correct to me.

          Can you throw some light on this?

          Show
          Vinod Kumar Vavilapalli added a comment - I've started looking at the patch. If I understand correctly the way preemption is done, the jobs are sorted in the order of deficits and tasks are taken away from the jobs at the bottom of the list (positive deficits) and given to the jobs with negative deficits at the top. By doing this, though we are catering to job's fair share, it doesn't look like we are serving a queue/pool 's guaranteed shares. So, a queue/pool can be running only its min-maps/min-reduces, but still tasks can be preempted from jobs of this queue. Is that correct? If that's the case, then that doesn't sound correct to me. Can you throw some light on this?
          Hide
          Matei Zaharia added a comment -

          New patch that applies to trunk.

          Show
          Matei Zaharia added a comment - New patch that applies to trunk.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12402607/hadoop-4665-v3.patch
          against trunk revision 757958.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/128/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402607/hadoop-4665-v3.patch against trunk revision 757958. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/128/console This message is automatically generated.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          The latest patch hadoop-4665-v3.patch is not applying cleanly to trunk, perhaps because of HADOOP-4788.

          Show
          Vinod Kumar Vavilapalli added a comment - The latest patch hadoop-4665-v3.patch is not applying cleanly to trunk, perhaps because of HADOOP-4788 .
          Hide
          Matei Zaharia added a comment -

          Passing patch through Hudson.

          Show
          Matei Zaharia added a comment - Passing patch through Hudson.
          Hide
          dhruba borthakur added a comment -

          We should not put it in the 0.20 release, especially because the branch is already cut.

          It should be committed to trunk for now. If, at a later time (after making the 0.20 release), the community decides to pull it into the 0.20.1 release, then we can do that.

          Paatch looks good. +1. Can you pl resubmit the patch so that HadoopQA gets to process it? Thanks.

          Show
          dhruba borthakur added a comment - We should not put it in the 0.20 release, especially because the branch is already cut. It should be committed to trunk for now. If, at a later time (after making the 0.20 release), the community decides to pull it into the 0.20.1 release, then we can do that. Paatch looks good. +1. Can you pl resubmit the patch so that HadoopQA gets to process it? Thanks.
          Hide
          Matei Zaharia added a comment -

          Updated the patch to work with the current trunk.

          Show
          Matei Zaharia added a comment - Updated the patch to work with the current trunk.
          Hide
          Matei Zaharia added a comment -

          Robert, this is not going into 0.20, is it? I thought Nigel said we should not do that.

          Show
          Matei Zaharia added a comment - Robert, this is not going into 0.20, is it? I thought Nigel said we should not do that.
          Hide
          Matei Zaharia added a comment -

          Here is an updated patch that removes the locking of the JobTracker in update(). Instead, we only lock the JT every 30 seconds when preemptTasksIfNecessary decides that it is time to check for tasks to preempt. This should improve JT scalability.

          Show
          Matei Zaharia added a comment - Here is an updated patch that removes the locking of the JobTracker in update(). Instead, we only lock the JT every 30 seconds when preemptTasksIfNecessary decides that it is time to check for tasks to preempt. This should improve JT scalability.
          Hide
          Nigel Daley added a comment -

          We've got to stabilize 0.20. We're currently working on fixing bugs. Adding new features to contrib components at this point would not get my vote. We're already 2.5 months past 0.20 core feature freeze.

          Show
          Nigel Daley added a comment - We've got to stabilize 0.20. We're currently working on fixing bugs. Adding new features to contrib components at this point would not get my vote. We're already 2.5 months past 0.20 core feature freeze.
          Hide
          Matei Zaharia added a comment -

          That sounds good. I'm soon going to remove that lock by the way, and ensure that we only lock the JT when there are tasks to kill.

          Show
          Matei Zaharia added a comment - That sounds good. I'm soon going to remove that lock by the way, and ensure that we only lock the JT when there are tasks to kill.
          Hide
          Amr Awadallah added a comment -

          > Does the middle ground option make sense ? It may help all of us with
          only a little compromise, no ?

          sure, thanks for being accommodating. What is rough eta for when 0.20.1
          would be out?

          – amr

          Show
          Amr Awadallah added a comment - > Does the middle ground option make sense ? It may help all of us with only a little compromise, no ? sure, thanks for being accommodating. What is rough eta for when 0.20.1 would be out? – amr
          Hide
          Hemanth Yamijala added a comment -

          Amr, that's a good point about the benefit to the community.

          One middle ground option could be to make it part of Hadoop 0.20.1. so that the community doesn't wait until 0.21 (which could be way out right now).

          One specific concern I had about this patch (from a cursory glance) was that it was locking the JobTracker, (the taskTrackerManager instance) in the update method which runs frequently. In general, this could interfere with regular processing in the jobtracker, like heartbeats (which are also synchronized on the same instance). We've in the past seen issues of this nature on large clusters. When the JT is locked up, tasktrackers could get lost and tasks could fail arbitrarily. HADOOP-5280 seems to be one specific instance of this (though we've not ascertained the reason JT got locked up).

          Does the middle ground option make sense ? It may help all of us with only a little compromise, no ?

          Show
          Hemanth Yamijala added a comment - Amr, that's a good point about the benefit to the community. One middle ground option could be to make it part of Hadoop 0.20.1. so that the community doesn't wait until 0.21 (which could be way out right now). One specific concern I had about this patch (from a cursory glance) was that it was locking the JobTracker, (the taskTrackerManager instance) in the update method which runs frequently. In general, this could interfere with regular processing in the jobtracker, like heartbeats (which are also synchronized on the same instance). We've in the past seen issues of this nature on large clusters. When the JT is locked up, tasktrackers could get lost and tasks could fail arbitrarily. HADOOP-5280 seems to be one specific instance of this (though we've not ascertained the reason JT got locked up). Does the middle ground option make sense ? It may help all of us with only a little compromise, no ?
          Hide
          Amr Awadallah added a comment -

          Hemanth,

          Facebook is not the only company that benefits from preemption, the
          community at large will benefit from this and the sooner that we get it
          out there the better. Also fact that Facebook has been running this for
          a number of weeks means that it will most likely not lead to any
          significant testing problems.

          Please consider fitting this in the 0.20 release, and let us know if
          there is anyway we can help to expedite.

          Thanks,

          – amr

          Show
          Amr Awadallah added a comment - Hemanth, Facebook is not the only company that benefits from preemption, the community at large will benefit from this and the sooner that we get it out there the better. Also fact that Facebook has been running this for a number of weeks means that it will most likely not lead to any significant testing problems. Please consider fitting this in the 0.20 release, and let us know if there is anyway we can help to expedite. Thanks, – amr
          Hide
          Hemanth Yamijala added a comment -

          Dhruba, we're planning on a release very soon based on 0.20. This would reset our testing cycle significantly. So, I would request you to move this to 0.21. Since you're already using this, I am assuming it will not impact you that much, I hope.

          Show
          Hemanth Yamijala added a comment - Dhruba, we're planning on a release very soon based on 0.20. This would reset our testing cycle significantly. So, I would request you to move this to 0.21. Since you're already using this, I am assuming it will not impact you that much, I hope.
          Hide
          dhruba borthakur added a comment -

          We haven been running this code in production for a few weeks now. I would have ideally liked it to go into 0.20 (because 0.20 is not yet released), but if that is difficult that I am fine putting it in 0.21

          Show
          dhruba borthakur added a comment - We haven been running this code in production for a few weeks now. I would have ideally liked it to go into 0.20 (because 0.20 is not yet released), but if that is difficult that I am fine putting it in 0.21
          Hide
          Hemanth Yamijala added a comment -

          Thanks, Matei. That would be very convenient. Dhruba, does this sound ok ? If you agree, we could set the fix in version to 0.21.

          Show
          Hemanth Yamijala added a comment - Thanks, Matei. That would be very convenient. Dhruba, does this sound ok ? If you agree, we could set the fix in version to 0.21.
          Hide
          Matei Zaharia added a comment -

          There's no reason to put it in 0.20 if 0.20 is in the testing stage right now. The issue was just opened against 0.20 because there was no 0.21 back then. It's true that the preemption is disabled by default but there's no need to complicate things.

          Show
          Matei Zaharia added a comment - There's no reason to put it in 0.20 if 0.20 is in the testing stage right now. The issue was just opened against 0.20 because there was no 0.21 back then. It's true that the preemption is disabled by default but there's no need to complicate things.
          Hide
          Hemanth Yamijala added a comment -

          Dhruba, we have been testing the FairScheduler of Hadoop 0.20 release quite heavily in the past couple of weeks as part of an evaluation we are doing. I cannot say how much committing this to Hadoop 0.20 could effect what we've tested. Is it really necessary for the 0.20 release ? Can we get it committed to trunk only instead ?

          Also, I'd looked at this code cursorily. I think pre-emption is an optional feature here, right ?

          Show
          Hemanth Yamijala added a comment - Dhruba, we have been testing the FairScheduler of Hadoop 0.20 release quite heavily in the past couple of weeks as part of an evaluation we are doing. I cannot say how much committing this to Hadoop 0.20 could effect what we've tested. Is it really necessary for the 0.20 release ? Can we get it committed to trunk only instead ? Also, I'd looked at this code cursorily. I think pre-emption is an optional feature here, right ?
          Hide
          dhruba borthakur added a comment -

          +1. code looks good.

          Show
          dhruba borthakur added a comment - +1. code looks good.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12399789/hadoop-4665-v1b.patch
          against trunk revision 742171.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12399789/hadoop-4665-v1b.patch against trunk revision 742171. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3813/console This message is automatically generated.
          Hide
          Matei Zaharia added a comment -

          Small update to change the fair scheduler log dir jobconf variable to a consistent format with the other event log related variables.

          Show
          Matei Zaharia added a comment - Small update to change the fair scheduler log dir jobconf variable to a consistent format with the other event log related variables.
          Hide
          Matei Zaharia added a comment -

          Here is an updated version of this patch. It includes some fixes from testing this at Facebook, as well as Joydeep's comments (change event log to use Log4j rolling file appender, and provide a way to set default preemption timeouts).

          Show
          Matei Zaharia added a comment - Here is an updated version of this patch. It includes some fixes from testing this at Facebook, as well as Joydeep's comments (change event log to use Log4j rolling file appender, and provide a way to set default preemption timeouts).
          Hide
          Matei Zaharia added a comment -

          We can definitely take out the flush if you are okay with possibly missing the end of the event log. Right now the event log was meant only as a debugging tool (which is why it's off by default), but maybe it's more useful to make it possible to always keep it on.

          I will add a parameter for default min share preemption timeout as well.

          Show
          Matei Zaharia added a comment - We can definitely take out the flush if you are okay with possibly missing the end of the event log. Right now the event log was meant only as a debugging tool (which is why it's off by default), but maybe it's more useful to make it possible to always keep it on. I will add a parameter for default min share preemption timeout as well.
          Hide
          Joydeep Sen Sarma added a comment -

          a couple of other points (let me get back on the comments above):

          • default preemption timeout(s) for pools. right now unless we configure on a per pool basis - the default timeout is infinite.
          • eventlog - this looks very expensive - there's a flush on every log() call. can we just take out this flush()?
          Show
          Joydeep Sen Sarma added a comment - a couple of other points (let me get back on the comments above): default preemption timeout(s) for pools. right now unless we configure on a per pool basis - the default timeout is infinite. eventlog - this looks very expensive - there's a flush on every log() call. can we just take out this flush()?
          Hide
          Matei Zaharia added a comment -

          Those are good points Joydeep. I will remove the getting conf variables every time, that was a mistake.

          About subtracting the waits: This is just a question of how we interpret the parameters. Maybe we want nodeLocalWait and rackLocalWaits to be two different times that get added up for some "total wait". I originally meant for rackLocalWait to always be bigger than nodeLocalWait and thus capture the maximum delay. But since that is confusing and can lead to misconfiguration, I will make them add up as you said.

          For your third point, the idea was as follows: When it still has a lot of maps left to launch, a job will almost always have node-local tasks, so waiting is fine. However, when there are only a few maps left, there will be fewer nodes on which it can launch node-local tasks, and these may be busy running long tasks or something. So, when it's waited for nodeLocalWait amount of time, the job will start being allowed to launch rack-local tasks instead. Once it has launched such a task, it is allowed to launch more rack-local tasks rather than having to begin the waiting all over again so that it doesn't drastically slow down the rate at which it can launch tasks if the nodes with node-local data still aren't becoming free. However, we remember the locality level of the last map launched, so if the job ever does manage to launch a node-local task again, we begin the wait period again. There's a similar story for going from rack-local to off-rack: the idea is that once you've had to wait so long as to launch an off-rack task, you probably have very few opportunities left for launching rack-local or node-local tasks, so you might as well be allowed to launch more off-rack tasks and finish the job rather than having your launch rate slowed to a trickle.

          Now it's possible that just using shorter waits but requiring the wait to happen every time you need to launch a non-local task will work too. I don't know, but it seemed that in my gridmix tests the current implementation worked fine even for very small jobs (going from 2% to 75-80% node locality for 3-map tasks on a 100-node cluster).

          Show
          Matei Zaharia added a comment - Those are good points Joydeep. I will remove the getting conf variables every time, that was a mistake. About subtracting the waits: This is just a question of how we interpret the parameters. Maybe we want nodeLocalWait and rackLocalWaits to be two different times that get added up for some "total wait". I originally meant for rackLocalWait to always be bigger than nodeLocalWait and thus capture the maximum delay. But since that is confusing and can lead to misconfiguration, I will make them add up as you said. For your third point, the idea was as follows: When it still has a lot of maps left to launch, a job will almost always have node-local tasks, so waiting is fine. However, when there are only a few maps left, there will be fewer nodes on which it can launch node-local tasks, and these may be busy running long tasks or something. So, when it's waited for nodeLocalWait amount of time, the job will start being allowed to launch rack-local tasks instead. Once it has launched such a task, it is allowed to launch more rack-local tasks rather than having to begin the waiting all over again so that it doesn't drastically slow down the rate at which it can launch tasks if the nodes with node-local data still aren't becoming free. However, we remember the locality level of the last map launched, so if the job ever does manage to launch a node-local task again, we begin the wait period again. There's a similar story for going from rack-local to off-rack: the idea is that once you've had to wait so long as to launch an off-rack task, you probably have very few opportunities left for launching rack-local or node-local tasks, so you might as well be allowed to launch more off-rack tasks and finish the job rather than having your launch rate slowed to a trickle. Now it's possible that just using shorter waits but requiring the wait to happen every time you need to launch a non-local task will work too. I don't know, but it seemed that in my gridmix tests the current implementation worked fine even for very small jobs (going from 2% to 75-80% node locality for 3-map tasks on a 100-node cluster).
          Hide
          Joydeep Sen Sarma added a comment -

          Dhruba and I looked at this together and got stuck on getAllowedLocalityLevel()

          • why subtract nodeLocalWait from rackLocalWait
          • why getting config variables each time
          • if we were not rack/node local last time - why don't we wait for locality next time? Seems like once we lose locality - we are anyway going to run up a deficit and then schedule a boatload of non-local tasks. seems like the whether we want to wait for locality or not should be based on how much deficit we are incurring and whether it's still warranted to wait for locality (as opposed to whether we were able to schedule the last task locally)

          thoughts?

          Show
          Joydeep Sen Sarma added a comment - Dhruba and I looked at this together and got stuck on getAllowedLocalityLevel() why subtract nodeLocalWait from rackLocalWait why getting config variables each time if we were not rack/node local last time - why don't we wait for locality next time? Seems like once we lose locality - we are anyway going to run up a deficit and then schedule a boatload of non-local tasks. seems like the whether we want to wait for locality or not should be based on how much deficit we are incurring and whether it's still warranted to wait for locality (as opposed to whether we were able to schedule the last task locally) thoughts?
          Hide
          Matei Zaharia added a comment -

          There is also one change here that is outside of the fair scheduler that I'd like feedback on. In order to allow the scheduler to kill tasks, I added a killTask method to the TaskTrackerManager interface. This means that you don't need to assume that the TaskTrackerManager is always a JobTracker, so you can write unit tests for it with a mock object. There was already a killJob in there. Is it okay to add killTask or is there a reason not to?

          Show
          Matei Zaharia added a comment - There is also one change here that is outside of the fair scheduler that I'd like feedback on. In order to allow the scheduler to kill tasks, I added a killTask method to the TaskTrackerManager interface. This means that you don't need to assume that the TaskTrackerManager is always a JobTracker, so you can write unit tests for it with a mock object. There was already a killJob in there. Is it okay to add killTask or is there a reason not to?
          Hide
          Matei Zaharia added a comment -

          Yikes, forgot to add, this patch also includes HADOOP-4789 because it is dependent on it. That might make it more confusing to read than it needs to be.

          Show
          Matei Zaharia added a comment - Yikes, forgot to add, this patch also includes HADOOP-4789 because it is dependent on it. That might make it more confusing to read than it needs to be.
          Hide
          Matei Zaharia added a comment -

          Here is an initial version of the patch for review. The main thing missing is unit tests.

          The patch adds two things. First there's the preemption, which works as described in the issue - jobs may preempt others if either they aren't receiving their guaranteed share for some time, or they are at below half their fair share and negative deficit for some time. The times can be configured in the fair scheduler config file and thus modified at runtime, and the guaranteed share timeouts are per pool. On top of this, to aid with debugging and development of the fair scheduler in the future, there is a scheduler event log, which is disabled by default but creates some event logs in tab-separated format in $hadoop.log.dir/fairscheduler if you turn it on. These are meant to be nitty-gritty detailed logs with machine-parsable event types rather than the "human-readable" logs that go into the standard log4j log for the JobTracker. They are also potentially much larger on a large cluster, which is why they're off by default.

          I'm running this through hudson to see whether there are complaints from findbugs, checkstyle, etc, but I will include some unit tests in the final patch.

          Show
          Matei Zaharia added a comment - Here is an initial version of the patch for review. The main thing missing is unit tests. The patch adds two things. First there's the preemption, which works as described in the issue - jobs may preempt others if either they aren't receiving their guaranteed share for some time, or they are at below half their fair share and negative deficit for some time. The times can be configured in the fair scheduler config file and thus modified at runtime, and the guaranteed share timeouts are per pool. On top of this, to aid with debugging and development of the fair scheduler in the future, there is a scheduler event log, which is disabled by default but creates some event logs in tab-separated format in $hadoop.log.dir/fairscheduler if you turn it on. These are meant to be nitty-gritty detailed logs with machine-parsable event types rather than the "human-readable" logs that go into the standard log4j log for the JobTracker. They are also potentially much larger on a large cluster, which is why they're off by default. I'm running this through hudson to see whether there are complaints from findbugs, checkstyle, etc, but I will include some unit tests in the final patch.

            People

            • Assignee:
              Matei Zaharia
              Reporter:
              Matei Zaharia
            • Votes:
              2 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development