Issue Details (XML | Word | Printable)

Key: HADOOP-4830
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Vinod K V
Reporter: Vinod K V
Votes: 0
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Have end to end tests based on MiniMRCluster to verify that queue capacities are honoured.

Created: 11/Dec/08 05:27 AM   Updated: 08/Jul/09 04:40 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works HADOOP-4830-20081222-svn.2 2008-12-22 01:02 PM Vinod K V 47 kB
Text File Licensed for inclusion in ASF works HADOOP-4830-20081229-svn.txt 2008-12-29 05:36 AM Vinod K V 50 kB
Text File Licensed for inclusion in ASF works HADOOP-4830-20090106-2-svn.txt 2009-01-06 10:52 AM Vinod K V 51 kB
Issue Links:
Blocker
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 07/Jan/09 05:59 AM


 Description  « Hide
At present, we only have unit tests that make use of FakeTaskManager and that only test the proper functionality of capacity scheduler in isolation. Many issues unearthed recently proved that this is not enough and that it is required to have end-to-end tests so that real JT is brought into the picture and with that the interaction of the scheduler with JT. This issue along with few other related jiras should automate and replace the end-to-end tests that are now manually done by QA, using MiniMRCluster.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Vinod K V added a comment - 11/Dec/08 05:35 AM
This particular JIRA should address the following:
  • Write/improve utilities that will help the test-cases. This includes
    • enhancing org.apache.hadoop.mapred.UtilsForTests.Waiting{Mapper|Reducer} so that the tasks that can finish by signalling is controlled and is a subset of all the tasks.
    • having a utility that makes a test to block and wait for a particular number of tasks to be running and occupying a cluster's slots.
    • setting up a way to configure the capacity scheduler for the tests.
  • Automate the tests related to queue capacities: submit jobs to different queues simultaneously and see that capacities are honored

Vinod K V made changes - 11/Dec/08 05:35 AM
Field Original Value New Value
Assignee Vinod K V [ vinodkv ]
Vinod K V made changes - 11/Dec/08 05:40 AM
Link This issue relates to HADOOP-4831 [ HADOOP-4831 ]
Vinod K V made changes - 11/Dec/08 05:45 AM
Link This issue relates to HADOOP-4832 [ HADOOP-4832 ]
Vinod K V made changes - 11/Dec/08 05:52 AM
Link This issue relates to HADOOP-4833 [ HADOOP-4833 ]
Vinod K V made changes - 11/Dec/08 05:56 AM
Link This issue relates to HADOOP-4834 [ HADOOP-4834 ]
Vinod K V made changes - 11/Dec/08 06:01 AM
Link This issue is blocked by HADOOP-4784 [ HADOOP-4784 ]
Vinod K V added a comment - 22/Dec/08 01:02 PM
Attaching patch. This includes
  • o.a.h.mapred.ControlledMapReduceJob to run a job whose tasks' execution can be precisely controlled by a user.
  • o.a.h.mapred.ClusterWithCapacityScheduler to start a MiniMRCluster with CapacityScheduler configured as the scheduler. It provides api for configuring both the cluster as well as the scheduler. Any test to verify behaviour of CapacityTaskScheduler should extend this class.
  • Tests o.a.h.m.TestControlledMapReduceJob and o.a.h.m.TestClusterWithCapacityScheduler to test both the above utility classes. Later is very minimal and can be improved as we go.
  • o.a.h.m.TestQueueCapacities to automate the tests related to queue capacities and to serve as a demonstration of how to use the above utilities for testing CapacityTaskScheduler. This includes basic queue capacity tests. Many other tests are possible but some of them involve testing reclamation of slots and hence will be done in a separate Jira (HADOOP-4834).

Vinod K V made changes - 22/Dec/08 01:02 PM
Attachment HADOOP-4830-20081222-svn.2 [ 12396592 ]
Vinod K V added a comment - 22/Dec/08 01:04 PM
Amar, can you please review this patch?

Thanks,
-Vinod


Vinod K V made changes - 22/Dec/08 01:04 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Hemanth Yamijala added a comment - 24/Dec/08 11:25 AM
Some comments:

ControlledMapReduceJob:

  • All paths should be created relative to the build directory. Something like new Path(System.getProperty("test.build.data","/tmp"), "signalFileDir-...")
  • Do we really need to create the temp file ? Is it only for creating a unique random number. Can we use Random for the same ?
  • Rather than split and get the task id, we can use the TaskID classes to get the same. The hierarchy extends to ID, which will give you the number. Also, rather than call it TaskID which has a specific meaning, can we call it taskNumber or something.
  • getTasksCounts: can't we directly use finishedMaps or finishedReduces
  • assertNTasksRunningAtSteadyState: The 5 seconds time limit brings in timing dependencies that should be avoided if we can. Ideally if we can check that two consecutive heartbeat cycles don't change the running counts, that should be enough. Can we check the state of the JT or the scheduler to get this information ?

ClusterWithCapacityScheduler:

  • Write the capacity scheduler configuration in a path relative to test.build.data
  • The default values for the job initialization related properties are fixed. So this can be removed now.
  • Please review the Log level of the log statements. I think some of them will be too verbose for an INFO level. For e.g. what keys we're writing to the scheduler conf.
  • Please have a more clear comment on why fs.getRawFileSystem().setConf(config); is required after setting it on the local file system object.

TestClusterWithCapacityScheduler doesn't seem specifically needed. A lot of the tests will exercise this and it will be very obvious if it doesn't work. Unlike the TestControlledMapReduceJob which is a simple test that can be easily verified for correctness with the default scheduler.

TestQueueCapacities:

  • Since we're trying to use MiniMR, can we by default have more than 1 tasktracker - like 4 or something, and suitably scale all the task counts. Likewise, I also think number of reduces should be non-zero for most cases. I think this would make it closer to reality. And since we're using controlled execution, it would really not make a difference to the test logic, right ?
  • I think we fixed reclaimcapacity time limit to be in seconds, rather than milliseconds.
  • Related to multiple queue tests, can we also have a test where jobs are submitted to different queues - all below the queue's capacity, and make sure they are all running. This will exercise some specific code paths related to job initialization, considering multiple queues for scheduling jobs etc.

Vinod K V added a comment - 29/Dec/08 05:36 AM
Attaching a new patch incorporating the review comments. Notes on some particular points follow.

assertNTasksRunningAtSteadyState: The 5 seconds time limit brings in timing dependencies that should be avoided if we can. Ideally if we can check that two consecutive heartbeat cycles don't change the running counts, that should be enough. Can we check the state of the JT or the scheduler to get this information ?

This is replaced with a ClusterWithCapacityScheduler.WaitTillAllTasksAreOccupied to test that all the slots of a particular type are occupied in the cluster. And this is done by looking at the ClusterStatus and waiting till the total number of tasks runing becomes equal to the maximum number of slots in the cluster.

TestClusterWithCapacityScheduler doesn't seem specifically needed. A lot of the tests will exercise this and it will be very obvious if it doesn't work. Unlike the TestControlledMapReduceJob which is a simple test that can be easily verified for correctness with the default scheduler.

The original intention was to test invalid configuration. Moved this out, it may be done later. Or if possible, we should move the invalid configuration related checks into CapacitySchedulerConf itself instead of having them in CapacityTaskScheduler.start()

Others:

  • Added ClusterWithCapacityScheduler.cleanUpSchedulerConfigFile.
  • Test time:
    TestQueueCapacities is taking an average of slightly more than 6 1/2 minutes for each run, excluding the build time. This is after a bit of refactoring is done to reuse clusters across multiple tests instead of starting a new cluster for every single test. I've tried to minimize this time as far as possible, but an independent effort should be taken up to reduce this test time.

Vinod K V made changes - 29/Dec/08 05:36 AM
Attachment HADOOP-4830-20081229-svn.txt [ 12396822 ]
Vinod K V made changes - 31/Dec/08 12:19 PM
Link This issue blocks HADOOP-4831 [ HADOOP-4831 ]
Hemanth Yamijala added a comment - 05/Jan/09 08:35 AM
Looking good. A few comments:
  • We are iterating over the task list to get the number of running tasks in ControlledMapReduceJob.getRunningTasksCount(). We check if the task is running using TaskInProgress.isRunning(). This method of computation seems like it would not be different from JobInProgress.runningMaps() or JobInProgress.runningReduces(). Can you please check if there is a difference ?
  • There are a few TODOs in the comments in ControlledMapReduceJob. These should be removed, or alternatively we should add a more descriptive comment about a possible improvement.
  • In writeFile, some lines related to replication are commented. These can be removed.
  • The RunningJob variable, rJob, can be private and not package private

When I ran the test on my system, it failed because of missing some classes related to Jetty:

org/mortbay/jetty/servlet/Context
java.lang.NoClassDefFoundError: org/mortbay/jetty/servlet/Context
    at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:955)
    at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:275)
    at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:119)
    at org.apache.hadoop.mapred.ClusterWithCapacityScheduler.startCluster(ClusterWithCapacityScheduler.java:101)
    at org.apache.hadoop.mapred.TestQueueCapacities.testSingleQueue(TestQueueCapacities.java:54)

This may be that we need to fix something in ivy.xml of the capacity scheduler. Copying the ivy.xml from streaming ran the test successfully.


Hemanth Yamijala added a comment - 05/Jan/09 08:36 AM
Also, in the next version, can you please update test-patch results, so I can commit the patch ?

Vinod K V added a comment - 06/Jan/09 10:52 AM
Attaching a new patch. Incorporated the above review comments. Notes about particular points follow:

We are iterating over the task list to get the number of running tasks in ControlledMapReduceJob.getRunningTasksCount(). We check if the task is running using TaskInProgress.isRunning(). This method of computation seems like it would not be different from JobInProgress.runningMaps() or JobInProgress.runningReduces(). Can you please check if there is a difference ?

As pointed out, there is no real difference. The original intention was to ensure that so and so number of tasks are really running on TaskTrackers. But on retrospection, I came to conclude that the number of tasks scheduled(JIP.runningTasks()) should suffice. This is because the number of tasks scheduled is same as the number of tasks running in the test environment(no lost trackers). Made changes to use JIP.runningTasks() regarding the same.

This may be that we need to fix something in ivy.xml of the capacity scheduler. Copying the ivy.xml from streaming ran the test successfully.

Made changes to ivy.xml in capacity scheduler src to include the needed jars. But, as Hemanth also concurs during a discussion, it may become cumbersome in future to add every new jar that capacity scheduler might not need directly but still needed to be included because of the dependency on underlying projects/modules say mapred/hdfs. Will file a new issue to see if this can be addressed generally.

While running tests, found that some tests were timing out. The actual reason turned out to be HADOOP-4977. The test TestQueueCapacities might fail sometimes because of that and till that gets fixed.

ant test-patch results:

[exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 15 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

Vinod K V made changes - 06/Jan/09 10:52 AM
Attachment HADOOP-4830-20090106-2-svn.txt [ 12397190 ]
Repository Revision Date User Message
ASF #732231 Wed Jan 07 05:51:03 UTC 2009 yhemanth HADOOP-4830. Add end-to-end test cases for testing queue capacities. Contributed by Vinod Kumar Vavilapalli.
Files Changed
ADD /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/TestControlledMapReduceJob.java
MODIFY /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/ClusterMapReduceTestCase.java
ADD /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/ControlledMapReduceJob.java
ADD /hadoop/core/trunk/src/contrib/capacity-scheduler/src/test/org/apache/hadoop/mapred/TestQueueCapacities.java
MODIFY /hadoop/core/trunk/src/contrib/capacity-scheduler/src/java/org/apache/hadoop/mapred/CapacitySchedulerConf.java
MODIFY /hadoop/core/trunk/CHANGES.txt
ADD /hadoop/core/trunk/src/contrib/capacity-scheduler/src/test/org/apache/hadoop/mapred/ClusterWithCapacityScheduler.java
MODIFY /hadoop/core/trunk/src/contrib/capacity-scheduler/ivy.xml

Repository Revision Date User Message
ASF #732233 Wed Jan 07 05:58:27 UTC 2009 yhemanth Merge -r 732230:732231 from trunk to branch 0.20 to fix HADOOP-4830.
Files Changed
MODIFY /hadoop/core/branches/branch-0.20/src/test/org/apache/hadoop/mapred/ClusterMapReduceTestCase.java
MODIFY /hadoop/core/branches/branch-0.20/src/contrib/capacity-scheduler/ivy.xml
ADD /hadoop/core/branches/branch-0.20/src/test/org/apache/hadoop/mapred/ControlledMapReduceJob.java (from /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/ControlledMapReduceJob.java)
ADD /hadoop/core/branches/branch-0.20/src/contrib/capacity-scheduler/src/test/org/apache/hadoop/mapred/TestQueueCapacities.java (from /hadoop/core/trunk/src/contrib/capacity-scheduler/src/test/org/apache/hadoop/mapred/TestQueueCapacities.java)
MODIFY /hadoop/core/branches/branch-0.20/src/contrib/capacity-scheduler/src/java/org/apache/hadoop/mapred/CapacitySchedulerConf.java
MODIFY /hadoop/core/branches/branch-0.20/CHANGES.txt
ADD /hadoop/core/branches/branch-0.20/src/contrib/capacity-scheduler/src/test/org/apache/hadoop/mapred/ClusterWithCapacityScheduler.java (from /hadoop/core/trunk/src/contrib/capacity-scheduler/src/test/org/apache/hadoop/mapred/ClusterWithCapacityScheduler.java)
ADD /hadoop/core/branches/branch-0.20/src/test/org/apache/hadoop/mapred/TestControlledMapReduceJob.java (from /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/TestControlledMapReduceJob.java)

Hemanth Yamijala added a comment - 07/Jan/09 05:59 AM
I just committed this to trunk and branch 0.20, as they are test cases and no change to functionality. Thanks, Vinod !

Hemanth Yamijala made changes - 07/Jan/09 05:59 AM
Resolution Fixed [ 1 ]
Fix Version/s 0.20.0 [ 12313438 ]
Hadoop Flags [Reviewed]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Nigel Daley made changes - 23/Apr/09 07:17 PM
Status Resolved [ 5 ] Closed [ 6 ]
Owen O'Malley made changes - 08/Jul/09 04:40 PM
Component/s contrib/capacity-sched [ 12312466 ]