Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-516

Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Incompatible change, Reviewed

      Description

      When a HighRAMJob turns up at the head of the queue, the current implementation of support for HighRAMJobs in the Capacity Scheduler has problem in that the scheduler stops assigning tasks to all TaskTrackers in the cluster until a HighRAMJob finds a suitable TaskTrackers for all its tasks.

      This causes a severe utilization problem since effectively no new tasks are allowed to run until the HighRAMJob (at the head of the queue) gets slots.

      1. mr516-ydist.patch
        170 kB
        Sreekanth Ramakrishnan
      2. HADOOP-5964_2_20090629_yhadoop.patch
        170 kB
        Arun C Murthy
      3. HADOOP-5964_2_20090629_yhadoop.patch
        170 kB
        Arun C Murthy
      4. MAPREDUCE-516-Y20.patch
        170 kB
        Sreekanth Ramakrishnan
      5. HADOOP-5964_1_20090623_yhadoop.patch
        170 kB
        Arun C Murthy
      6. HADOOP-5964-12.patch
        168 kB
        Hemanth Yamijala
      7. HADOOP-5964_11_20090623.patch
        174 kB
        Arun C Murthy
      8. HADOOP-5964-11.patch
        165 kB
        Hemanth Yamijala
      9. HADOOP-5964_10_20090622.patch
        162 kB
        Arun C Murthy
      10. HADOOP-5964_9_20090619.patch
        163 kB
        Arun C Murthy
      11. HADOOP-5964_9_20090619.patch
        164 kB
        Arun C Murthy
      12. HADOOP-5964_8_20090618.patch
        163 kB
        Arun C Murthy
      13. HADOOP-5964_7_20090618.patch
        148 kB
        Arun C Murthy
      14. HADOOP-5964_6_20090617.patch
        145 kB
        Arun C Murthy
      15. HADOOP-5964_4_20090615.patch
        84 kB
        Arun C Murthy
      16. HADOOP-5964_2_20090609.patch
        68 kB
        Arun C Murthy
      17. HADOOP-5964_1_20090608.patch
        54 kB
        Arun C Murthy
      18. HADOOP-5964_0_20090602.patch
        31 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          Arun C Murthy added a comment -

          A much better model is for the scheduler to pick specific TaskTrackers and reserve slots on them while accounting for the same against the HighRAMJob and it's queue. This would mean that once there is a reserved slot(s), per-task of the HighRAMJob, other slots in the cluster can be handed out to other jobs/queues in the cluster.

          Once the accounting for reserved slots is fixed, it would automatically ensure that a HighRAMJob can only reserve slots upto the quota of the queue it belongs to. Hence the next enhancement is to pick specific slots and hold them rather than hold slots on every TaskTracker.

          Picking slots for High RAM Jobs

          The key to better support for HighRAMJobs is to reserve slots on specific TaskTracker. Of course one could get arbitrarily clever while picking slots, factors to be considered are:

          • Locality of input for the specific map-task of the job
          • Minimize expected delay time until the slot in freed on a specific !TaskTracker

          For the first cut, I'd propose we consider only locality and not expected time. Once we fix speculative execution (HADOOP-2141), we will more of the necessary features to predict expected time etc., hence the pushback.

          Accounting for Reserved Slots

          It is critical that we charge the queues of the HighRAMJobs when we hold reserved slots for them to ensure that they stay under their capacity and can't runaway with slots in the cluster. The proposal is to charge jobs/queues immediately when we reserve slots on a TaskTracker (when it can't be immediately run).

          Metering

          While metering HighRAMJobs, it would be incorrect to meter jobs (slot-hours etc.) by equating reserved slots to running slots. The proposal is to meter HighRAMJobs for open-but-held slots and running slots. (Open but held slots are those which are free on the TaskTracker but are being held while more become available for the HighRAMJob's tasks.)

          Notes on Implementation and Challenges

          As discussed above the proposal is to consider just data-locality while reserving slots. Assuming this, there are a couple of implementation choices once we reserved the slot:

          • Proposal1: Hand out the task to the TaskTracker with a directive to start the task only when sufficient slots are freed-up to the this task.
          • Proposal2: Hold the task at the scheduler noting which slot (i.e. TaskTracker) has been reserved for the same.
          Proposal 1

          Here we would introduce a queue of ready to run tasks at the TaskTracker and fill it in with the task of the HighRAMJobs.

          Pros
          • The primary advantage of taking this route is that it greatly reduces the cost of implementation; it is fairly simple to introduce a WAITING_FOR_SLOT state for the task and have the necessary information at the TaskTracker to launch it at the appropriate time (i.e. when sufficient slots are free).
          • Looking ahead, this might be a good start to do more global scheduling across jobs too where we might
          Cons
          • The major problem with this approach is that it touches a fairly sensitive part of code in the current implementation of the framework... it's fairly risky to tweak the TaskTracker code at this point, along with the JVMManager etc.
          • We would still need to tweak the JobTracker to handle the WAITING_FOR_SLOT state e.g. ensure the TaskInitializationThread doesn't kill these tasks etc.
          • We need to consider how this affects other schedulers (probably will not).
          Proposal 2

          Here we would start marking slots as reserved (per task per job) and maintain information to assign the slot to the task when it eventually does free up.

          Pros
          • Simpler since all state management is done centrally.
          • Lesser risk since all information is maintained in the scheduler.
          Cons
          • Currently the framework isn't setup to maintain this information: we do not have a single place (e.g. a TT class in the !JobTracker) to maintain information per-tracker i.e. reserved slots etc.
          • More engineering effort to maintain maps from !TaskTracker to task to which it's reserved for and vice-versa.
          Recommendation
          • Proposal 1 for the attendant benefits and the leverage it gives us going forward (global scheduling etc.)

          User Interface

          It is important for users (and queue-admins) to understand that there are slots which are reserved for HighRAMJobs which result in lower running maps/reduces w.r.t the queue-capacities. It would be nice to add reserved slots to the JobTracker/Job UI, and also to the Queue-Info in the Scheduler page.

          Show
          Arun C Murthy added a comment - A much better model is for the scheduler to pick specific TaskTrackers and reserve slots on them while accounting for the same against the HighRAMJob and it's queue. This would mean that once there is a reserved slot(s), per-task of the HighRAMJob, other slots in the cluster can be handed out to other jobs/queues in the cluster. Once the accounting for reserved slots is fixed, it would automatically ensure that a HighRAMJob can only reserve slots upto the quota of the queue it belongs to. Hence the next enhancement is to pick specific slots and hold them rather than hold slots on every TaskTracker. Picking slots for High RAM Jobs The key to better support for HighRAMJobs is to reserve slots on specific TaskTracker. Of course one could get arbitrarily clever while picking slots, factors to be considered are: Locality of input for the specific map-task of the job Minimize expected delay time until the slot in freed on a specific !TaskTracker For the first cut, I'd propose we consider only locality and not expected time. Once we fix speculative execution ( HADOOP-2141 ), we will more of the necessary features to predict expected time etc., hence the pushback. Accounting for Reserved Slots It is critical that we charge the queues of the HighRAMJobs when we hold reserved slots for them to ensure that they stay under their capacity and can't runaway with slots in the cluster. The proposal is to charge jobs/queues immediately when we reserve slots on a TaskTracker (when it can't be immediately run). Metering While metering HighRAMJobs, it would be incorrect to meter jobs (slot-hours etc.) by equating reserved slots to running slots. The proposal is to meter HighRAMJobs for open-but-held slots and running slots. (Open but held slots are those which are free on the TaskTracker but are being held while more become available for the HighRAMJob's tasks.) Notes on Implementation and Challenges As discussed above the proposal is to consider just data-locality while reserving slots. Assuming this, there are a couple of implementation choices once we reserved the slot: Proposal1: Hand out the task to the TaskTracker with a directive to start the task only when sufficient slots are freed-up to the this task. Proposal2: Hold the task at the scheduler noting which slot (i.e. TaskTracker) has been reserved for the same. Proposal 1 Here we would introduce a queue of ready to run tasks at the TaskTracker and fill it in with the task of the HighRAMJobs. Pros The primary advantage of taking this route is that it greatly reduces the cost of implementation; it is fairly simple to introduce a WAITING_FOR_SLOT state for the task and have the necessary information at the TaskTracker to launch it at the appropriate time (i.e. when sufficient slots are free). Looking ahead, this might be a good start to do more global scheduling across jobs too where we might Cons The major problem with this approach is that it touches a fairly sensitive part of code in the current implementation of the framework... it's fairly risky to tweak the TaskTracker code at this point, along with the JVMManager etc. We would still need to tweak the JobTracker to handle the WAITING_FOR_SLOT state e.g. ensure the TaskInitializationThread doesn't kill these tasks etc. We need to consider how this affects other schedulers (probably will not). Proposal 2 Here we would start marking slots as reserved (per task per job) and maintain information to assign the slot to the task when it eventually does free up. Pros Simpler since all state management is done centrally. Lesser risk since all information is maintained in the scheduler. Cons Currently the framework isn't setup to maintain this information: we do not have a single place (e.g. a TT class in the !JobTracker) to maintain information per-tracker i.e. reserved slots etc. More engineering effort to maintain maps from !TaskTracker to task to which it's reserved for and vice-versa. Recommendation Proposal 1 for the attendant benefits and the leverage it gives us going forward (global scheduling etc.) User Interface It is important for users (and queue-admins) to understand that there are slots which are reserved for HighRAMJobs which result in lower running maps/reduces w.r.t the queue-capacities. It would be nice to add reserved slots to the JobTracker/Job UI, and also to the Queue-Info in the Scheduler page.
          Hide
          Arun C Murthy added a comment -

          Very early patch.

          I haven't introduced a WAITING_FOR_SLOT task state since it might be desirable for the ExpireLaunchingTasks thread to actually kill high-ram jobs which have waited too long. Thoughts?

          Show
          Arun C Murthy added a comment - Very early patch. I haven't introduced a WAITING_FOR_SLOT task state since it might be desirable for the ExpireLaunchingTasks thread to actually kill high-ram jobs which have waited too long. Thoughts?
          Hide
          Arun C Murthy added a comment -

          Reasonably well tested patch for review... this patch implements queuing of high-ram tasks (at most one task is queued) at the TaskTracker. It also contains necessary fixes to the UI and introduces a new 'waiting' state for the TIP etc.

          Show
          Arun C Murthy added a comment - Reasonably well tested patch for review... this patch implements queuing of high-ram tasks (at most one task is queued) at the TaskTracker. It also contains necessary fixes to the UI and introduces a new 'waiting' state for the TIP etc.
          Hide
          Arun C Murthy added a comment -

          Updated patch.

          Show
          Arun C Murthy added a comment - Updated patch.
          Hide
          Arun C Murthy added a comment -

          After much consideration I decided to revert back to the approach where we keep tasks on the JobTracker until sufficient memory is available, the problem with caching them on the TaskTracker is that it caused too many changes to the task status-reporting components which are currently too hairy to muck with. I'm still testing the current patch.

          Show
          Arun C Murthy added a comment - After much consideration I decided to revert back to the approach where we keep tasks on the JobTracker until sufficient memory is available, the problem with caching them on the TaskTracker is that it caused too many changes to the task status-reporting components which are currently too hairy to muck with. I'm still testing the current patch.
          Hide
          Arun C Murthy added a comment -

          Reasonably well tested patch, appreciate any feedback while I finish up last round of testing.

          Show
          Arun C Murthy added a comment - Reasonably well tested patch, appreciate any feedback while I finish up last round of testing.
          Hide
          Hemanth Yamijala added a comment -

          Arun, I've started looking at this patch. It did not apply on trunk with TestCapacityScheduler failing to merge. I tried to fix it - the conflict seemed to be only in an import statement. But when I ran the test case to check whether the merge was fine, I got the following failures:

          junit.framework.AssertionFailedError: null
          at org.apache.hadoop.mapred.TestCapacityScheduler.testUserLimitsForHighMemoryJobs(TestCapacityScheduler.java:1373)
          junit.framework.AssertionFailedError: null
          at org.apache.hadoop.mapred.TestCapacityScheduler.testClusterBlockingForLackOfMemory(TestCapacityScheduler.java:1846)
          junit.framework.AssertionFailedError: null
          at org.apache.hadoop.mapred.TestCapacityScheduler.testMemoryMatchingWithRetiredJobs(TestCapacityScheduler.java:1944)
          junit.framework.ComparisonFailure: null expected:<Used capacity: [2 (33.3]% of Capacity)> but was:<Used capacity: [4 (66.7]% of Capacity)>
          at org.apache.hadoop.mapred.TestCapacityScheduler.checkOccupiedSlots(TestCapacityScheduler.java:2814)
          at org.apache.hadoop.mapred.TestCapacityScheduler.testHighRamJobWithSpeculativeExecution(TestCapacityScheduler.java:2383)

          Particularly from the last test, I am hoping that its only the test case that needs fixing, because in actual, it seems like the patch has actually increased the number of used slots. smile.

          I will continue to look at the changes under this assumption, and get to the test cases in a bit.

          Show
          Hemanth Yamijala added a comment - Arun, I've started looking at this patch. It did not apply on trunk with TestCapacityScheduler failing to merge. I tried to fix it - the conflict seemed to be only in an import statement. But when I ran the test case to check whether the merge was fine, I got the following failures: junit.framework.AssertionFailedError: null at org.apache.hadoop.mapred.TestCapacityScheduler.testUserLimitsForHighMemoryJobs(TestCapacityScheduler.java:1373) junit.framework.AssertionFailedError: null at org.apache.hadoop.mapred.TestCapacityScheduler.testClusterBlockingForLackOfMemory(TestCapacityScheduler.java:1846) junit.framework.AssertionFailedError: null at org.apache.hadoop.mapred.TestCapacityScheduler.testMemoryMatchingWithRetiredJobs(TestCapacityScheduler.java:1944) junit.framework.ComparisonFailure: null expected:<Used capacity: [2 (33.3] % of Capacity)> but was:<Used capacity: [4 (66.7] % of Capacity)> at org.apache.hadoop.mapred.TestCapacityScheduler.checkOccupiedSlots(TestCapacityScheduler.java:2814) at org.apache.hadoop.mapred.TestCapacityScheduler.testHighRamJobWithSpeculativeExecution(TestCapacityScheduler.java:2383) Particularly from the last test, I am hoping that its only the test case that needs fixing, because in actual, it seems like the patch has actually increased the number of used slots. smile . I will continue to look at the changes under this assumption, and get to the test cases in a bit.
          Hide
          Hemanth Yamijala added a comment -

          Also, I wanted to note that this patch is changing the TaskScheduler interface. Can we reach out explicitly to folks working on the fair scheduler and dynamic scheduler - maybe add them to the watch list of this JIRA or something ?

          Show
          Hemanth Yamijala added a comment - Also, I wanted to note that this patch is changing the TaskScheduler interface. Can we reach out explicitly to folks working on the fair scheduler and dynamic scheduler - maybe add them to the watch list of this JIRA or something ?
          Hide
          Arun C Murthy added a comment -

          Some bug fixes and added counters to track how long tasks are held at the Scheduler after reserving tasktrackers...

          Show
          Arun C Murthy added a comment - Some bug fixes and added counters to track how long tasks are held at the Scheduler after reserving tasktrackers...
          Hide
          Arun C Murthy added a comment -

          Some notes about this patch:

          1. I've introduced a new org.apache.hadoop.mapred.server.jobtracker.TaskTracker class to track all information about a given TaskTracker at the JobTracker such as TaskTrackerStatus, reservations for high-ram jobs etc. I chose the new package based on the proposals at HADOOP-398.
          2. I've changed the TaskScheduler.assignTasks api to use the newly introduced rather than the tepid TaskTrackerStatus. Clearly other schedulers (ala CapacityTaskScheduler) can start to take advantage of this as I've fixed them appropriately in this patch.
          3. I've had to make some classes public (JobInProgress, TaskTrackerStatus) for org.apache.hadoop.mapred.server.jobtracker.TaskTracker to work with appropriate caveats in the javadocs etc.
          Show
          Arun C Murthy added a comment - Some notes about this patch: I've introduced a new org.apache.hadoop.mapred.server.jobtracker.TaskTracker class to track all information about a given TaskTracker at the JobTracker such as TaskTrackerStatus, reservations for high-ram jobs etc. I chose the new package based on the proposals at HADOOP-398 . I've changed the TaskScheduler.assignTasks api to use the newly introduced rather than the tepid TaskTrackerStatus. Clearly other schedulers (ala CapacityTaskScheduler) can start to take advantage of this as I've fixed them appropriately in this patch. I've had to make some classes public (JobInProgress, TaskTrackerStatus) for org.apache.hadoop.mapred.server.jobtracker.TaskTracker to work with appropriate caveats in the javadocs etc.
          Hide
          Hemanth Yamijala added a comment -

          I am looking at this patch as comprising of three separate parts:

          • Changes to the scheduler for fixing the under utilization problem in the face of high RAM jobs
          • The new TaskTracker class, its lifecycle and changes in JobTracker to support this.
          • The changes on the old TaskTracker class to account for number of slots.

          I've currently done the first and partly the second part.

          Some comments so far:

          TaskTrackerStatus:

          • countOccupiedMapSlots: the check for whether a task is running, based on it's status, seems complicated enough to move to an API that can be called from both countMapTasks and this API. This way, any changes to it will cause the right behavior for both APIs. Likewise, for reduces.

          mapreduce.TaskTracker:

          • reserveSlots: java doc refers to reserving on 'map' slots.
          • Why do we need to maintain a count of slots reserved (numFallowMapSlots). I see that the accessor API is not used anywhere.

          CapacityTaskScheduler:

          • Why are we reserving available slots on the tasktracker. Shouldn't we always be reserving only how much this job requires ? In that case, do we need a re-reservation ?
          • When we try to get a task for a job ignoring user limits (i.e. if the cluster is free), we are not reserving TTs. Is this by design ? Also, is it for the same reason that we are not checking for user limits when assigning a task to a reserved TT ?

          JobConf:

          • Since computeNumSlotsPerMap is used only by CapacityScheduler right now, should we just leave this computation out of JobConf ?

          JobInitializationPoller:

          • Lets not pass the scheduler instance to the poller. I think it only needs the number of map slots and reduce slots. We can pass just that much. We've seen in the past that passing entire objects like the scheduler makes testing classes difficult. Also, not all information is required.

          JobTracker:

          • When a job is killed, we are not clearing reserved trackers for this job.
          • Likewise, when a TT is blacklisted do we need to remove the reservations ?
          • It seems like the changes in JobTracker can be reduced a little if we do not change APIs that are passed a TTstatus object or a tasktracker name. We can still change the maps to be built of TaskTracker objects, but retrieve the status wherever necessary and pass it to methods. This way the changes may be fewer and easier to verify. For e.g. I think this is possible in the ExpireTrackers class.

          Some nits:

          • In some places, formatting more than 80 characters in a line. (E.g. mapreduce.TaskTracker.java)
          • There are lot of LOG.info statements, possibly to enable testing / debugging. Can you please remove these ?
          • Fallow seems a complicated word to understand. Is 'Reserved' good enough ?

          Will continue with the review...

          Show
          Hemanth Yamijala added a comment - I am looking at this patch as comprising of three separate parts: Changes to the scheduler for fixing the under utilization problem in the face of high RAM jobs The new TaskTracker class, its lifecycle and changes in JobTracker to support this. The changes on the old TaskTracker class to account for number of slots. I've currently done the first and partly the second part. Some comments so far: TaskTrackerStatus: countOccupiedMapSlots: the check for whether a task is running, based on it's status, seems complicated enough to move to an API that can be called from both countMapTasks and this API. This way, any changes to it will cause the right behavior for both APIs. Likewise, for reduces. mapreduce.TaskTracker: reserveSlots: java doc refers to reserving on 'map' slots. Why do we need to maintain a count of slots reserved (numFallowMapSlots). I see that the accessor API is not used anywhere. CapacityTaskScheduler: Why are we reserving available slots on the tasktracker. Shouldn't we always be reserving only how much this job requires ? In that case, do we need a re-reservation ? When we try to get a task for a job ignoring user limits (i.e. if the cluster is free), we are not reserving TTs. Is this by design ? Also, is it for the same reason that we are not checking for user limits when assigning a task to a reserved TT ? JobConf: Since computeNumSlotsPerMap is used only by CapacityScheduler right now, should we just leave this computation out of JobConf ? JobInitializationPoller: Lets not pass the scheduler instance to the poller. I think it only needs the number of map slots and reduce slots. We can pass just that much. We've seen in the past that passing entire objects like the scheduler makes testing classes difficult. Also, not all information is required. JobTracker: When a job is killed, we are not clearing reserved trackers for this job. Likewise, when a TT is blacklisted do we need to remove the reservations ? It seems like the changes in JobTracker can be reduced a little if we do not change APIs that are passed a TTstatus object or a tasktracker name. We can still change the maps to be built of TaskTracker objects, but retrieve the status wherever necessary and pass it to methods. This way the changes may be fewer and easier to verify. For e.g. I think this is possible in the ExpireTrackers class. Some nits: In some places, formatting more than 80 characters in a line. (E.g. mapreduce.TaskTracker.java) There are lot of LOG.info statements, possibly to enable testing / debugging. Can you please remove these ? Fallow seems a complicated word to understand. Is 'Reserved' good enough ? Will continue with the review...
          Hide
          Arun C Murthy added a comment -

          Thanks for the review Hemanth - as you pointed out the patch needs a bit more work to remove logging etc.

          I'm attaching a patch which incorporates your feedback.

          Some clarifications:

          TaskTrackerStatus:

          • countOccupiedMapSlots: the check for whether a task is running, based on it's status, seems complicated enough to move to an API that can be called from both countMapTasks and this API. This way, any changes to it will cause the right behavior for both APIs. Likewise, for reduces.

          mapreduce.TaskTracker:

          • reserveSlots: java doc refers to reserving on 'map' slots.
          • Why do we need to maintain a count of slots reserved (numFallowMapSlots). I see that the accessor API is not used anywhere.

          Fixed.

          * Why are we reserving available slots on the tasktracker. Shouldn't we always be reserving only how much this job requires ? In that case, do we need a re-reservation ?

          We reserve all available slots since by definition all of them are for the same task, else we wouldn't reserve if we could run right away.
          We need 're-reservation' since #reserved-slots (on the same tasktracker) might change over time and we need to track these for metering (JobCounter.FALLOW_SLOTS_MILLIS_

          {MAPS|REDUCES}

          ).

          * When we try to get a task for a job ignoring user limits (i.e. if the cluster is free), we are not reserving TTs. Is this by design ? Also, is it for the same reason that we are not checking for user limits when assigning a task to a reserved TT ?

          Yes.

          * Lets not pass the scheduler instance to the poller. I think it only needs the number of map slots and reduce slots. We can pass just that much. We've seen in the past that passing entire objects like the scheduler makes testing classes difficult. Also, not all information is required.

          Done. I've added a JobInitializationPoller.JobInitializationContext and use that rather than the passing the scheduler.

          JobTracker:

          • When a job is killed, we are not clearing reserved trackers for this job.
          • Likewise, when a TT is blacklisted do we need to remove the reservations ?

          My bad. Thanks for catching this. Fixed.

          It seems like the changes in JobTracker can be reduced a little if we do not change APIs that are passed a TTstatus object or a tasktracker name. We can still change the maps to be built of TaskTracker objects, but retrieve the status wherever necessary and pass it to methods. This way the changes may be fewer and easier to verify. For e.g. I think this is possible in the ExpireTrackers class.

          I really don't think it's a good idea to use both TaskTracker and TaskTrackerStatus in the long run, it's really hard to maintain. Which is why I bit the bullet and changed all of them.

          Show
          Arun C Murthy added a comment - Thanks for the review Hemanth - as you pointed out the patch needs a bit more work to remove logging etc. I'm attaching a patch which incorporates your feedback. Some clarifications: TaskTrackerStatus: countOccupiedMapSlots: the check for whether a task is running, based on it's status, seems complicated enough to move to an API that can be called from both countMapTasks and this API. This way, any changes to it will cause the right behavior for both APIs. Likewise, for reduces. mapreduce.TaskTracker: reserveSlots: java doc refers to reserving on 'map' slots. Why do we need to maintain a count of slots reserved (numFallowMapSlots). I see that the accessor API is not used anywhere. Fixed. * Why are we reserving available slots on the tasktracker. Shouldn't we always be reserving only how much this job requires ? In that case, do we need a re-reservation ? We reserve all available slots since by definition all of them are for the same task, else we wouldn't reserve if we could run right away. We need 're-reservation' since #reserved-slots (on the same tasktracker) might change over time and we need to track these for metering (JobCounter.FALLOW_SLOTS_MILLIS_ {MAPS|REDUCES} ). * When we try to get a task for a job ignoring user limits (i.e. if the cluster is free), we are not reserving TTs. Is this by design ? Also, is it for the same reason that we are not checking for user limits when assigning a task to a reserved TT ? Yes. * Lets not pass the scheduler instance to the poller. I think it only needs the number of map slots and reduce slots. We can pass just that much. We've seen in the past that passing entire objects like the scheduler makes testing classes difficult. Also, not all information is required. Done. I've added a JobInitializationPoller.JobInitializationContext and use that rather than the passing the scheduler. JobTracker: When a job is killed, we are not clearing reserved trackers for this job. Likewise, when a TT is blacklisted do we need to remove the reservations ? My bad. Thanks for catching this. Fixed. It seems like the changes in JobTracker can be reduced a little if we do not change APIs that are passed a TTstatus object or a tasktracker name. We can still change the maps to be built of TaskTracker objects, but retrieve the status wherever necessary and pass it to methods. This way the changes may be fewer and easier to verify. For e.g. I think this is possible in the ExpireTrackers class. I really don't think it's a good idea to use both TaskTracker and TaskTrackerStatus in the long run, it's really hard to maintain. Which is why I bit the bullet and changed all of them.
          Hide
          Arun C Murthy added a comment -

          Forgot to add that I've managed to successfully test this patch on large clusters.

          Show
          Arun C Murthy added a comment - Forgot to add that I've managed to successfully test this patch on large clusters.
          Hide
          Hemanth Yamijala added a comment -

          I've looked at most of the code changes (excluding tests and examples). Here are a few more comments:

          CapacityTaskScheduler:

          • In getTaskFromQueue, I would request a comment on why we are not reserving tasktrackers in the second pass (the reason, as we discussed offline, was because we don't think we need to give users more leeway by reserving slots given they are already over their user limit)

          JobTracker:

          • hostnameToTrackerName seems a wrong name. it should be hostnameToTracker
          • Comment on trackerExpiryQueue refers to TreeSet of status objects.
          • In recovery, there is an 'interesting' behavior currently that a job can be initialized by both the RecoveryManager or a job initialization thread like EagerTaskInitializer or JobInitializationPoller. Which means that relying on preInitializeJob to set the right number of slots may be broken.
          • Since we are not storing information about reservations across restarts, one impact could be on the fact that the counter information about how long reservations were made for a job on a tracker could be lost. This may not be a big issue because reservations themselves are lost on restart, but just wanted to check what you thought.

          mapreduce.TaskTracker:

          • I am thinking if it will be good to make unreserveSlots re-entrant. I struggled a bit to determine that it will never be called twice in any scenario, which seems to be the case now. But if we can make it re-entrant by simply ignoring the operation if the reserved Job is null, it might save us some corner case bugs. Note we are currently throwing a runtime exception.

          JobConf:

          • We are not handling the case where memory based scheduling is disabled, but jobconf has some non default value for the job size (say because of user misconfiguration). computeNumSlotsPerMap should probably check the value and return 1 if it is disabled. Otherwise it could get set to a -ve value.

          MemoryMatcher:

          • The computation of committed memory included tasks that were in the commit pending state for a reason. We'll need to check this with someone from the M/R team.
          Show
          Hemanth Yamijala added a comment - I've looked at most of the code changes (excluding tests and examples). Here are a few more comments: CapacityTaskScheduler: In getTaskFromQueue, I would request a comment on why we are not reserving tasktrackers in the second pass (the reason, as we discussed offline, was because we don't think we need to give users more leeway by reserving slots given they are already over their user limit) JobTracker: hostnameToTrackerName seems a wrong name. it should be hostnameToTracker Comment on trackerExpiryQueue refers to TreeSet of status objects. In recovery, there is an 'interesting' behavior currently that a job can be initialized by both the RecoveryManager or a job initialization thread like EagerTaskInitializer or JobInitializationPoller. Which means that relying on preInitializeJob to set the right number of slots may be broken. Since we are not storing information about reservations across restarts, one impact could be on the fact that the counter information about how long reservations were made for a job on a tracker could be lost. This may not be a big issue because reservations themselves are lost on restart, but just wanted to check what you thought. mapreduce.TaskTracker: I am thinking if it will be good to make unreserveSlots re-entrant. I struggled a bit to determine that it will never be called twice in any scenario, which seems to be the case now. But if we can make it re-entrant by simply ignoring the operation if the reserved Job is null, it might save us some corner case bugs. Note we are currently throwing a runtime exception. JobConf: We are not handling the case where memory based scheduling is disabled, but jobconf has some non default value for the job size (say because of user misconfiguration). computeNumSlotsPerMap should probably check the value and return 1 if it is disabled. Otherwise it could get set to a -ve value. MemoryMatcher: The computation of committed memory included tasks that were in the commit pending state for a reason. We'll need to check this with someone from the M/R team.
          Hide
          Arun C Murthy added a comment -

          Thanks for the comments Hemanth. Here is another patch which incorporates all your comments.

          Show
          Arun C Murthy added a comment - Thanks for the comments Hemanth. Here is another patch which incorporates all your comments.
          Hide
          Arun C Murthy added a comment -

          Minor edit to previous version of this patch.

          Show
          Arun C Murthy added a comment - Minor edit to previous version of this patch.
          Hide
          Arun C Murthy added a comment -

          Updated patch, I had to fix some corner cases.

          Show
          Arun C Murthy added a comment - Updated patch, I had to fix some corner cases.
          Hide
          Hemanth Yamijala added a comment -

          This is a patch that fixes failing capacity scheduler tests. Summary of changes:

          • Modified the fake objects in TestCapacityScheduler to include the correct slot information
          • Modified the display string in CapacityTaskScheduler.updateQSIObjects to what seemed more reasonable.
          • Reverted changes I had done in HADOOP-5964, as they are no longer needed with this patch. The previous code relied on jobconf to compute number of slots and had to handle cases where the job was not found (when it was retired). Since we now have this information in the TaskStatus, there's no need to handle this case.
          • Fixed testHighRamJobWithSpeculativeExecution - the previous test case wasn't entirely correct. Only catch is that the new test case does not cover a part of the code path to do with handling speculative reduces. I suppose that can be done in a follow-up JIRA.

          With these changes, TestCapacityScheduler passes in my local machine.

          Show
          Hemanth Yamijala added a comment - This is a patch that fixes failing capacity scheduler tests. Summary of changes: Modified the fake objects in TestCapacityScheduler to include the correct slot information Modified the display string in CapacityTaskScheduler.updateQSIObjects to what seemed more reasonable. Reverted changes I had done in HADOOP-5964 , as they are no longer needed with this patch. The previous code relied on jobconf to compute number of slots and had to handle cases where the job was not found (when it was retired). Since we now have this information in the TaskStatus, there's no need to handle this case. Fixed testHighRamJobWithSpeculativeExecution - the previous test case wasn't entirely correct. Only catch is that the new test case does not cover a part of the code path to do with handling speculative reduces. I suppose that can be done in a follow-up JIRA. With these changes, TestCapacityScheduler passes in my local machine.
          Hide
          Arun C Murthy added a comment -

          Updated and hopefully final patch. Passes test cases and 'ant test-patch':

               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 30 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          
          
          Show
          Arun C Murthy added a comment - Updated and hopefully final patch. Passes test cases and 'ant test-patch': [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 30 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          Hemanth Yamijala added a comment -

          I looked at the last patch. It seems fine to me, except for one small problem I'd commented on earlier. In JobConf.compute.. methods, we should check if any of the memory parameters are not defined and then return 1, otherwise, we could end up computing negative values. The updated patch has only this one change.

          Since currently this API is used only by CapacityTaskScheduler, I ran the relevant test cases in CS, and they passed.

          Results of test-patch output:

               [exec] -1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 30 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories.
               [exec]
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          I checked with Giri on the -1 on eclipse classpath, and he told me that this could be ignored, because it doesn't run very well with Ivy.

          Based on this, I will commit the patch.

          Show
          Hemanth Yamijala added a comment - I looked at the last patch. It seems fine to me, except for one small problem I'd commented on earlier. In JobConf.compute.. methods, we should check if any of the memory parameters are not defined and then return 1, otherwise, we could end up computing negative values. The updated patch has only this one change. Since currently this API is used only by CapacityTaskScheduler, I ran the relevant test cases in CS, and they passed. Results of test-patch output: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 30 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 Eclipse classpath. The patch causes the Eclipse classpath to differ from the contents of the lib directories. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. I checked with Giri on the -1 on eclipse classpath, and he told me that this could be ignored, because it doesn't run very well with Ivy. Based on this, I will commit the patch.
          Hide
          Arun C Murthy added a comment -

          Hemanth, the JobConf.compute* methods do not have access to the memory parameters since they are in the JobTracker/Scheduler's conf?

          Show
          Arun C Murthy added a comment - Hemanth, the JobConf.compute* methods do not have access to the memory parameters since they are in the JobTracker/Scheduler's conf?
          Hide
          Arun C Murthy added a comment -

          Never mind - your changes seem fine.

          The only concern is that the patch you just uploaded seems much smaller (168K) than the one I uploaded last night (174K). Can you please check?

          Show
          Arun C Murthy added a comment - Never mind - your changes seem fine. The only concern is that the patch you just uploaded seems much smaller (168K) than the one I uploaded last night (174K). Can you please check?
          Hide
          Hemanth Yamijala added a comment -

          Arun and I checked the difference in file sizes. The difference is that Arun was using git and I was using SVN and it seems SVN generated patches are smaller. The number of modified files in the patch also matches. So, I think it is OK.

          Show
          Hemanth Yamijala added a comment - Arun and I checked the difference in file sizes. The difference is that Arun was using git and I was using SVN and it seems SVN generated patches are smaller. The number of modified files in the patch also matches. So, I think it is OK.
          Hide
          Hemanth Yamijala added a comment -

          I just committed this. Thanks, Arun !

          Show
          Hemanth Yamijala added a comment - I just committed this. Thanks, Arun !
          Hide
          Arun C Murthy added a comment -

          Patch for yahoo 0.20 branch.

          Show
          Arun C Murthy added a comment - Patch for yahoo 0.20 branch.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching latest patch for internal Y! 20, modification to JobConf as per latest Hemanths Patch.

          Also, ran TestCapacityScheduler to check if the test passes successfully.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching latest patch for internal Y! 20, modification to JobConf as per latest Hemanths Patch. Also, ran TestCapacityScheduler to check if the test passes successfully.
          Hide
          Arun C Murthy added a comment -

          Updated to reflect changes to yhadoop-0.20.

          Show
          Arun C Murthy added a comment - Updated to reflect changes to yhadoop-0.20.
          Hide
          Arun C Murthy added a comment -

          Forgot to do "--no-prefix". Fixed now.

          Show
          Arun C Murthy added a comment - Forgot to do "--no-prefix". Fixed now.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/ )
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching Yahoo! distribution patch.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching Yahoo! distribution patch.

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development