Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3859

CapacityScheduler incorrectly utilizes extra-resources of queue for high-memory jobs

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.2.1
    • Component/s: capacity-sched
    • Labels:
      None
    • Release Note:
      Fixed wrong CapacityScheduler resource allocation for high memory consumption jobs

      Description

      Imagine, we have a queue A with capacity 10 slots and 20 as extra-capacity, jobs which use 3 map slots will never consume more than 9 slots, regardless how many free slots on a cluster.

      1. test-to-fail.patch.txt
        5 kB
        Sergey Tryuber
      2. MAPREDUCE-3859_MR1_fix_and_test.patch.txt
        6 kB
        Sergey Tryuber

        Issue Links

          Activity

          Show
          Mike Roark added a comment - Correction, still an issue in CDH4.2. Fix is the as Sergey's comment for 4.1.2: https://issues.apache.org/jira/browse/MAPREDUCE-3859?focusedCommentId=13659278&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13659278
          Hide
          Mike Roark added a comment -

          I recently tested this issue in CDH4.2 to determine if we needed to patch it. I couldn't reproduce it, so it must have gotten fixed by that version.

          Show
          Mike Roark added a comment - I recently tested this issue in CDH4.2 to determine if we needed to patch it. I couldn't reproduce it, so it must have gotten fixed by that version.
          Hide
          Arun C Murthy added a comment -

          I'm resolving this for MR1 since I'll need to open a separate YARN jira for branch-2.

          Thanks Sergey!

          Show
          Arun C Murthy added a comment - I'm resolving this for MR1 since I'll need to open a separate YARN jira for branch-2. Thanks Sergey!
          Hide
          Sergey Tryuber added a comment -

          Thanks, Arun

          Show
          Sergey Tryuber added a comment - Thanks, Arun
          Hide
          Arun C Murthy added a comment -

          Sergey Tryuber I've just committed this to branch-1 and branch-1.2, so we'll pick it up for hadoop-1.2.1.

          I've also help add a test case and add this to trunk/branch-2. Thanks!

          Show
          Arun C Murthy added a comment - Sergey Tryuber I've just committed this to branch-1 and branch-1.2, so we'll pick it up for hadoop-1.2.1. I've also help add a test case and add this to trunk/branch-2. Thanks!
          Hide
          Sergey Tryuber added a comment -

          Arun, I've attached patch for "branch-1" with testcase and fix (thanks for pointing me to the right branch).

          "happy to help with YARN/trunk if you want." - yes, please. You know, I had troubles with understanding of test cases of YARN version of CS. I'm not sure about correctness of testing architecture, where there is one huge capacity scheduler configuration with lots of queues. This scheduler configuration is created at the beginning of each test by "Before" method and each test uses that configuration. I think this is not a good choice, because it doesn't allow to test edge cases and hard for understanding (there are no comments at all)). So please, could you help me and take care about fix for YARN.

          P.S. Hardcored mocks are great, but, personally, I'd prefer "old school" with inversion of control ("strategy" pattern) and agile architecture.

          Show
          Sergey Tryuber added a comment - Arun, I've attached patch for "branch-1" with testcase and fix (thanks for pointing me to the right branch). "happy to help with YARN/trunk if you want." - yes, please. You know, I had troubles with understanding of test cases of YARN version of CS. I'm not sure about correctness of testing architecture, where there is one huge capacity scheduler configuration with lots of queues. This scheduler configuration is created at the beginning of each test by "Before" method and each test uses that configuration. I think this is not a good choice, because it doesn't allow to test edge cases and hard for understanding (there are no comments at all)). So please, could you help me and take care about fix for YARN. P.S. Hardcored mocks are great, but, personally, I'd prefer "old school" with inversion of control ("strategy" pattern) and agile architecture.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12583806/MAPREDUCE-3859_MR1_fix_and_test.patch.txt
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3653//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12583806/MAPREDUCE-3859_MR1_fix_and_test.patch.txt against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3653//console This message is automatically generated.
          Hide
          Sergey Tryuber added a comment -

          testcase and fix

          Show
          Sergey Tryuber added a comment - testcase and fix
          Hide
          Sergey Tryuber added a comment -

          Fix is for MR1 only. Test + fix is in the patch.

          Show
          Sergey Tryuber added a comment - Fix is for MR1 only. Test + fix is in the patch.
          Hide
          Arun C Murthy added a comment -

          Sergey Tryuber Please use hadoop/common/branches/branch-1 for MR1, and trunk for YARN.

          I'd appreciate if you could provide a full patch with the fix and the test-case for both branches - happy to help with YARN/trunk if you want. Thanks!

          Show
          Arun C Murthy added a comment - Sergey Tryuber Please use hadoop/common/branches/branch-1 for MR1, and trunk for YARN. I'd appreciate if you could provide a full patch with the fix and the test-case for both branches - happy to help with YARN/trunk if you want. Thanks!
          Hide
          Arun C Murthy added a comment -

          Sergey Tryuber Sorry! I lost track of this again. Sincere apologies.

          The same bug exists in CS in YARN:

              Resource currentCapacity =
                  Resources.lessThan(resourceCalculator, clusterResource, 
                      usedResources, queueCapacity) ?
                      queueCapacity : Resources.add(usedResources, required);
          

          Do you mind providing a patch for that too? Something like:

          index 64f7114..7d8800a 100644
          --- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
          +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
          @@ -997,7 +997,7 @@ private Resource computeUserLimit(FiCaSchedulerApp application,
          
               Resource currentCapacity =
                   Resources.lessThan(resourceCalculator, clusterResource,
          -            usedResources, queueCapacity) ?
          +            Resources.add(usedResources, required), queueCapacity) ?
                       queueCapacity : Resources.add(usedResources, required);
          
               // Never allow a single user to take more than the
          

          Guilty as charged on mocks for testing in YARN... smile

          However, it does result in our tests running in ~20mins rather than 5-6hrs it takes in hadoop-1!

          Show
          Arun C Murthy added a comment - Sergey Tryuber Sorry! I lost track of this again. Sincere apologies. The same bug exists in CS in YARN: Resource currentCapacity = Resources.lessThan(resourceCalculator, clusterResource, usedResources, queueCapacity) ? queueCapacity : Resources.add(usedResources, required); Do you mind providing a patch for that too? Something like: index 64f7114..7d8800a 100644 --- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java +++ hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java @@ -997,7 +997,7 @@ private Resource computeUserLimit(FiCaSchedulerApp application, Resource currentCapacity = Resources.lessThan(resourceCalculator, clusterResource, - usedResources, queueCapacity) ? + Resources.add(usedResources, required), queueCapacity) ? queueCapacity : Resources.add(usedResources, required); // Never allow a single user to take more than the Guilty as charged on mocks for testing in YARN... smile However, it does result in our tests running in ~20mins rather than 5-6hrs it takes in hadoop-1!
          Hide
          Sergey Tryuber added a comment -

          Mike, that's great! So I think this task can be closed, unless someone from Cloudera (their MR1 in CDH4 is still be affected) wants to take care about this issue and port the fix to old Capacity Scheduler into their sources.

          For the others who faces this issue, below is a brief step-by-step instruction for CDH4.1.2:

          • Download sources from https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads. Note: you need hadoop-0.20-mapreduce-0.20.2+1265 tarball.
          • Unpack it and go to root directory.
          • Apply changes from the first comment and test case from attached patch.
          • Also you should add the following lines:
                reactor.repo=https\://repository.cloudera.com/content/repositories/snapshots
                version=2.0.0-mr1-cdh4.1.2
            

            into src/contrib/index/ivy/libraries.properties and src/contrib/capacity-scheduler/ivy/libraries.properties files.

          • Test fixes that were made:
                ant test-contrib
            
          • Build a jar file:
                cd src/contrib/capacity-scheduler/
                ant jar
                cd -
            
          • The result file will be placed at build/contrib/capacity-scheduler/hadoop-capacity-scheduler-2.0.0-mr1-cdh4.1.2.jar.
          • Replace original file with the fixed on a node where JobTracker is started. Original file is placed in /usr/lib/hadoop-0.20-mapreduce/contrib/capacity-scheduler/ directory.
          Show
          Sergey Tryuber added a comment - Mike, that's great! So I think this task can be closed, unless someone from Cloudera (their MR1 in CDH4 is still be affected) wants to take care about this issue and port the fix to old Capacity Scheduler into their sources. For the others who faces this issue, below is a brief step-by-step instruction for CDH4.1.2: Download sources from https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads . Note: you need hadoop-0.20-mapreduce-0.20.2+1265 tarball. Unpack it and go to root directory. Apply changes from the first comment and test case from attached patch. Also you should add the following lines: reactor.repo=https\: //repository.cloudera.com/content/repositories/snapshots version=2.0.0-mr1-cdh4.1.2 into src/contrib/index/ivy/libraries.properties and src/contrib/capacity-scheduler/ivy/libraries.properties files. Test fixes that were made: ant test-contrib Build a jar file: cd src/contrib/capacity-scheduler/ ant jar cd - The result file will be placed at build/contrib/capacity-scheduler/hadoop-capacity-scheduler-2.0.0-mr1-cdh4.1.2.jar. Replace original file with the fixed on a node where JobTracker is started. Original file is placed in /usr/lib/hadoop-0.20-mapreduce/contrib/capacity-scheduler/ directory.
          Hide
          Mike Roark added a comment -

          Good call, I'd forgotten to readjust mapred.capacity-scheduler.default-user-limit-factor after I added some additional queues, so it was skewing the numbers.

          After fixing that I got exactly what I expected.

          Show
          Mike Roark added a comment - Good call, I'd forgotten to readjust mapred.capacity-scheduler.default-user-limit-factor after I added some additional queues, so it was skewing the numbers. After fixing that I got exactly what I expected.
          Hide
          Sergey Tryuber added a comment -

          Mike, your test results look a little bit strange even for 2 slots per reducer. Because you've said that max capacity is 60. So I would expect that all 60 slots are used in this case. Try to play with "user limit factor". Also try to set up initial capacity to a little be higher value that 4 slots. I'm afraid there is another, not related to this, bug when "slots per task" > "initial capacity".

          Arun, Matt, today I have a look into trunk (I believe this is what you call "1.3.0" release, because there is no 1.3 brunch). And I found there fully reworked capacity scheduler (YARN). There is another abstraction now which is called "Resource" instead of "slot/task". I was digging into it for a couple of hours and got to the feeling that this bug is gone there. I even found a test which tests something similar and tried to create my own test, but test case (TestLeafQueue.java) organized very poorly and, basically, tests nothing (mocks over mocks, no human readable logic and so on). Sorry, I've spent couple of hours trying to rewrite it and understood that it would take several more days for me. So I give it up. But, once again, the bug seems to be gone in YARN version of CS, so nothing to fix there.

          For everyone else who is affected by this bug (old Capacity Scheduler), please, use a hot fix from my first comment. Or, Arun, you can commit that fix and attached test case (yep, old CapacityScheduler were covered by test cases much better than in yarn) to appropriate brunch - I just don't know which brunch to use and I didn't found "contrib" module in trunk.

          Show
          Sergey Tryuber added a comment - Mike, your test results look a little bit strange even for 2 slots per reducer. Because you've said that max capacity is 60. So I would expect that all 60 slots are used in this case. Try to play with "user limit factor". Also try to set up initial capacity to a little be higher value that 4 slots. I'm afraid there is another, not related to this, bug when "slots per task" > "initial capacity". Arun, Matt, today I have a look into trunk (I believe this is what you call "1.3.0" release, because there is no 1.3 brunch). And I found there fully reworked capacity scheduler (YARN). There is another abstraction now which is called "Resource" instead of "slot/task". I was digging into it for a couple of hours and got to the feeling that this bug is gone there. I even found a test which tests something similar and tried to create my own test, but test case (TestLeafQueue.java) organized very poorly and, basically, tests nothing (mocks over mocks, no human readable logic and so on). Sorry, I've spent couple of hours trying to rewrite it and understood that it would take several more days for me. So I give it up. But, once again, the bug seems to be gone in YARN version of CS, so nothing to fix there. For everyone else who is affected by this bug (old Capacity Scheduler), please, use a hot fix from my first comment. Or, Arun, you can commit that fix and attached test case (yep, old CapacityScheduler were covered by test cases much better than in yarn) to appropriate brunch - I just don't know which brunch to use and I didn't found "contrib" module in trunk.
          Hide
          Mike Roark added a comment -

          I applied Sergey's patch from the first comment in my cdh3u5 cluster and had some improvement for jobs with 3 slots per task, but the numbers did not change for other sizes of tasks. Sergey, do you think I'm doing something wrong?

          Same setup as above: https://issues.apache.org/jira/browse/MAPREDUCE-3859?focusedCommentId=13609117&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13609117

          mapred.reduce.tasks slots per reducer expected running tasks running tasks without patch using slots without patch running tasks with patch using slots with patch
          30 1 30 30 30 30 30
          30 2 30 16 32 16 32
          30 3 20 1 3 11 33
          30 4 10 8 32 8 32
          30 5 10 8 40 8 40
          30 6 10 8 48 8 48
          Show
          Mike Roark added a comment - I applied Sergey's patch from the first comment in my cdh3u5 cluster and had some improvement for jobs with 3 slots per task, but the numbers did not change for other sizes of tasks. Sergey, do you think I'm doing something wrong? Same setup as above: https://issues.apache.org/jira/browse/MAPREDUCE-3859?focusedCommentId=13609117&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13609117 mapred.reduce.tasks slots per reducer expected running tasks running tasks without patch using slots without patch running tasks with patch using slots with patch 30 1 30 30 30 30 30 30 2 30 16 32 16 32 30 3 20 1 3 11 33 30 4 10 8 32 8 32 30 5 10 8 40 8 40 30 6 10 8 48 8 48
          Hide
          Matt Foley added a comment -

          Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.

          Show
          Matt Foley added a comment - Changed Target Version to 1.3.0 upon release of 1.2.0. Please change to 1.2.1 if you intend to submit a fix for branch-1.2.
          Hide
          Arun C Murthy added a comment -

          Sergey Tryuber Sorry, I lost track of this. I think you are right...

          A bit of history: The CS, initially, did not support multiple-slots per task i.e. HighRAM jobs. In that case, the check 'if (queueSlotsOccupied < queueCapacity)' was correct, since in that case queueCapacity would at least be (queueSlotsOccupied + 1), there-by satisfying the current ask. Obviously, that breaks with multiple slots per job... and hence the bug.

          The bug exists in YARN (i.e. MR2) CapacityScheduler too - could you please provide a fix for that too?

          I'll go ahead and commit them both... thanks!

          Apologies again for losing track of this! My bad.

          Show
          Arun C Murthy added a comment - Sergey Tryuber Sorry, I lost track of this. I think you are right... A bit of history: The CS, initially, did not support multiple-slots per task i.e. HighRAM jobs. In that case, the check 'if (queueSlotsOccupied < queueCapacity)' was correct, since in that case queueCapacity would at least be (queueSlotsOccupied + 1), there-by satisfying the current ask. Obviously, that breaks with multiple slots per job... and hence the bug. The bug exists in YARN (i.e. MR2) CapacityScheduler too - could you please provide a fix for that too? I'll go ahead and commit them both... thanks! Apologies again for losing track of this! My bad.
          Hide
          Sergey Tryuber added a comment -

          What I know is that the bug is still present in CDH4.1 MR1. So we had to patch Capacity Scheduler there as well...

          Show
          Sergey Tryuber added a comment - What I know is that the bug is still present in CDH4.1 MR1. So we had to patch Capacity Scheduler there as well...
          Hide
          Mike Roark added a comment -

          Any updates on this bug? It is affecting us as well.. it has a pretty bad effect on cluster utilization for these kinds of jobs. I tested this locally to see if it was this issue, results below:

          10 datanodes, each with 6 slots for reducers, 60 reducer slots total. No other jobs running.
          Running jobs in a queue which has "Reduce tasks, Capacity: 4 slots, Maximum capacity: 60 slots"
          Reducer sleeps for a while. This allows me to check steady state reducer slot allocation.

          mapred.reduce.tasks slots per reducer expected running reduce tasks running reduce tasks using slots percent of capacity notes
          30 1 30 30 30 750  
          30 2 30 16 32 800  
          30 3 20 1 3 75 really bad
          30 4 10 8 32 800  
          30 5 10 8 40 1000  
          30 6 10 8 48 1200  
          30 7(err) 0 0 0 0 job hangs, but I expected this
          Show
          Mike Roark added a comment - Any updates on this bug? It is affecting us as well.. it has a pretty bad effect on cluster utilization for these kinds of jobs. I tested this locally to see if it was this issue, results below: 10 datanodes, each with 6 slots for reducers, 60 reducer slots total. No other jobs running. Running jobs in a queue which has "Reduce tasks, Capacity: 4 slots, Maximum capacity: 60 slots" Reducer sleeps for a while. This allows me to check steady state reducer slot allocation. mapred.reduce.tasks slots per reducer expected running reduce tasks running reduce tasks using slots percent of capacity notes 30 1 30 30 30 750   30 2 30 16 32 800   30 3 20 1 3 75 really bad 30 4 10 8 32 800   30 5 10 8 40 1000   30 6 10 8 48 1200   30 7(err) 0 0 0 0 job hangs, but I expected this
          Hide
          Matt Foley added a comment -

          Moved to 1.2.0 upon release of 1.1.0.

          Show
          Matt Foley added a comment - Moved to 1.2.0 upon release of 1.1.0.
          Hide
          Arun C Murthy added a comment -

          Hmm... maybe not.

          Let me think more about it's implications on user-limit etc.

          Show
          Arun C Murthy added a comment - Hmm... maybe not. Let me think more about it's implications on user-limit etc.
          Hide
          Arun C Murthy added a comment - - edited

          Thanks, I'll run the test.

          Also, maybe the fix would be better expressed as:

              int currentCapacity = max(queueSlotsOccupied, queueCapacity) + numSlotsRequested;
          
          Show
          Arun C Murthy added a comment - - edited Thanks, I'll run the test. Also, maybe the fix would be better expressed as: int currentCapacity = max(queueSlotsOccupied, queueCapacity) + numSlotsRequested;
          Hide
          Sergey Tryuber added a comment -

          Arun, Harsh, I've submitted patch (test-to-fail.patch.txt) with test (it should be applied against "branch-1.0" branch).

          Test creates 20 map slots cluster (5 nodes with 4 slots) and queue with capacity 10 slots, max capacity 16 slots, user limit factor 2. The cluster is idle. When we submit high memory job (5 mappers with 4 slots per mapper), it should consume 16 slots (4 mappers should be run), but it doesn't because of the bug.

          Test with fix (fix is not in the patch) works well.

          Feel free to give any remarks (about code style, for example), because that's my first hadoop coding. Arun, agree with you, that each bug report should be "double-checked".

          Show
          Sergey Tryuber added a comment - Arun, Harsh, I've submitted patch (test-to-fail.patch.txt) with test (it should be applied against "branch-1.0" branch). Test creates 20 map slots cluster (5 nodes with 4 slots) and queue with capacity 10 slots, max capacity 16 slots, user limit factor 2. The cluster is idle. When we submit high memory job (5 mappers with 4 slots per mapper), it should consume 16 slots (4 mappers should be run), but it doesn't because of the bug. Test with fix (fix is not in the patch) works well. Feel free to give any remarks (about code style, for example), because that's my first hadoop coding. Arun, agree with you, that each bug report should be "double-checked".
          Hide
          Sergey Tryuber added a comment -

          Patch with test which fails for high memory jobs

          Show
          Sergey Tryuber added a comment - Patch with test which fails for high memory jobs
          Hide
          Arun C Murthy added a comment -

          I'm not against fixing the CS, I just want to ensure we do the 'right' fix.

          Show
          Arun C Murthy added a comment - I'm not against fixing the CS, I just want to ensure we do the 'right' fix.
          Hide
          Arun C Murthy added a comment -

          That would be great, thanks.

          Show
          Arun C Murthy added a comment - That would be great, thanks.
          Hide
          Sergey Tryuber added a comment -

          We've tried to play with user-limit-factor in any variants (2,3, 10, 100 and even more). Well, Arun, I just write a test for CapacityScheduler (this test should fail). Before preparing any patches, I'll put the code here and you'll take a look on it. Ok?

          Show
          Sergey Tryuber added a comment - We've tried to play with user-limit-factor in any variants (2,3, 10, 100 and even more). Well, Arun, I just write a test for CapacityScheduler (this test should fail). Before preparing any patches, I'll put the code here and you'll take a look on it. Ok?
          Hide
          Arun C Murthy added a comment -

          Sergey - can you please try this again by bumping up user-limit-factor to 2 or 3?

          Show
          Arun C Murthy added a comment - Sergey - can you please try this again by bumping up user-limit-factor to 2 or 3?
          Hide
          Sergey Tryuber added a comment -

          Arun, no, that's not 'user-limit-factor' issue(( This option perfectly works in case if job consumes only 1 slot per task and we've been using this option for a while. This bug affects only cases if job consumes more than one slot per task.

          Harsh, what version should I try to patch? Is it branch-1.0? Or trunk too?

          Show
          Sergey Tryuber added a comment - Arun, no, that's not 'user-limit-factor' issue(( This option perfectly works in case if job consumes only 1 slot per task and we've been using this option for a while. This bug affects only cases if job consumes more than one slot per task. Harsh, what version should I try to patch? Is it branch-1.0? Or trunk too?
          Hide
          Arun C Murthy added a comment -

          Sergey, I'm pretty sure the reason you are hitting this is that you have a single user in your queue.

          By default, a single user can't exceed the queue's capacity (10 in this case). You can use 'user-limit-factor' to bump that up: http://hadoop.apache.org/common/docs/r1.0.0/capacity_scheduler.html#Configuration

          Show
          Arun C Murthy added a comment - Sergey, I'm pretty sure the reason you are hitting this is that you have a single user in your queue. By default, a single user can't exceed the queue's capacity (10 in this case). You can use 'user-limit-factor' to bump that up: http://hadoop.apache.org/common/docs/r1.0.0/capacity_scheduler.html#Configuration
          Hide
          Harsh J added a comment -

          Sergey,

          Thanks for taking a dig at the code and coming up with a fix! Would you be interested in posting a patch fix for this as well? We'd require a test case that fails without the fix as well.

          Let us know if thats possible, thanks again!

          Show
          Harsh J added a comment - Sergey, Thanks for taking a dig at the code and coming up with a fix! Would you be interested in posting a patch fix for this as well? We'd require a test case that fails without the fix as well. Let us know if thats possible, thanks again!
          Hide
          Harsh J added a comment -

          This affects 1.x as well, comparing the difference in commits to CS between CDH3's CS and Apache 1.x.

          Show
          Harsh J added a comment - This affects 1.x as well, comparing the difference in commits to CS between CDH3's CS and Apache 1.x.
          Hide
          Sergey Tryuber added a comment -

          Sorry, I don't know what exactly version of Hadoop is used in cdh3u1 distribution. There we have following lines in CapacitySchedulerQueue.java in assignSlotsToJob method:

          int queueSlotsOccupied = getNumSlotsOccupied(taskType);
          int currentCapacity;
          
          if (queueSlotsOccupied < queueCapacity) {
            currentCapacity = queueCapacity;
          }
          else {
            currentCapacity = queueSlotsOccupied + numSlotsRequested;
          }
          

          Imagine we have a job with 1 slot per task, if we have queue with 10 configured capacity and 9 occupied slots (imagine, we have large maximum capacity and a lot of free slots on cluster), then currentCapacity=10 and task will be scheduled properly. Later, when will have 10 occupied slots, currentCapacity=11 and all will be fine too. And so on...

          Now imagine, we have a job with 3 slots per task, if we have queue with 10 configured capacity and 9 occupied slots, then currentCapacity=10, but that's not enough for scheduling this new task!!! So, this job will never use more then 9 slots!

          I've fixed this problem by changing:

          if (queueSlotsOccupied < queueCapacity) {
          

          on

          if (queueSlotsOccupied + numSlotsRequested <= queueCapacity) {
          

          I've rebuilt cdh3u1 from sources, deployed jar on the cluster and CapacityScheduler works well now for me.

          Also I've checkouted current Hadoop's trunk. Unfortunately, sources of CapacityScheduler dramatically changed. But I've found the similar lines in LeafQueue.java in computeUserLimit method:

          final int currentCapacity = 
            (consumed < queueCapacity) ? 
                queueCapacity : (consumed + required.getMemory());
          

          So, it seems to me, this bug also affects the latest CapacityScheduler

          Show
          Sergey Tryuber added a comment - Sorry, I don't know what exactly version of Hadoop is used in cdh3u1 distribution. There we have following lines in CapacitySchedulerQueue.java in assignSlotsToJob method: int queueSlotsOccupied = getNumSlotsOccupied(taskType); int currentCapacity; if (queueSlotsOccupied < queueCapacity) { currentCapacity = queueCapacity; } else { currentCapacity = queueSlotsOccupied + numSlotsRequested; } Imagine we have a job with 1 slot per task, if we have queue with 10 configured capacity and 9 occupied slots (imagine, we have large maximum capacity and a lot of free slots on cluster), then currentCapacity=10 and task will be scheduled properly. Later, when will have 10 occupied slots, currentCapacity=11 and all will be fine too. And so on... Now imagine, we have a job with 3 slots per task, if we have queue with 10 configured capacity and 9 occupied slots, then currentCapacity=10 , but that's not enough for scheduling this new task!!! So, this job will never use more then 9 slots! I've fixed this problem by changing: if (queueSlotsOccupied < queueCapacity) { on if (queueSlotsOccupied + numSlotsRequested <= queueCapacity) { I've rebuilt cdh3u1 from sources, deployed jar on the cluster and CapacityScheduler works well now for me. Also I've checkouted current Hadoop's trunk. Unfortunately, sources of CapacityScheduler dramatically changed. But I've found the similar lines in LeafQueue.java in computeUserLimit method: final int currentCapacity = (consumed < queueCapacity) ? queueCapacity : (consumed + required.getMemory()); So, it seems to me, this bug also affects the latest CapacityScheduler

            People

            • Assignee:
              Sergey Tryuber
              Reporter:
              Sergey Tryuber
            • Votes:
              6 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development