Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-522

Rewrite TestQueueCapacities to make it simpler and avoid timeout errors

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: capacity-sched
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We have seen TestQueueCapacities fail periodically and there have been a couple of times fixes partially fixed the problem, the most recent instance being HADOOP-5869. I found another instance of failure, while running tests locally while testing a different patch. This was a different symptom from the ones we've seen before. The core problem is that the test is too complex and relies on too many things working correctly to be useful. It would make sense to revisit the purpose of the test and see if a simpler model can serve it.

      1. mapred-522-ydist.patch
        27 kB
        Sreekanth Ramakrishnan
      2. mapreduce-522-4.patch
        27 kB
        Sreekanth Ramakrishnan
      3. mapreduce-522-3.patch
        27 kB
        Sreekanth Ramakrishnan
      4. mapreduce-522-2.patch
        27 kB
        Sreekanth Ramakrishnan

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/)

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #15 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/15/ )
          Hide
          Hemanth Yamijala added a comment -

          I just committed this. Thanks, Sreekanth !

          Show
          Hemanth Yamijala added a comment - I just committed this. Thanks, Sreekanth !
          Hide
          Sreekanth Ramakrishnan added a comment -

          All capacity scheduler tests pass with the patch

          Show
          Sreekanth Ramakrishnan added a comment - All capacity scheduler tests pass with the patch
          Hide
          Hemanth Yamijala added a comment -

          Looks fine to me. +1 for the patch.

          Show
          Hemanth Yamijala added a comment - Looks fine to me. +1 for the patch.
          Hide
          Sreekanth Ramakrishnan added a comment -

          output from ant test-patch:

               [exec]
               [exec] +1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 12 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
               [exec]
               [exec]
          
          Show
          Sreekanth Ramakrishnan added a comment - output from ant test-patch: [exec] [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 12 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec]
          Hide
          Sreekanth Ramakrishnan added a comment -

          Patch for Yahoo! distribution

          Show
          Sreekanth Ramakrishnan added a comment - Patch for Yahoo! distribution
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching new patch, incorporating Hemanths offline comment. Have tasks which have setup and cleanup.

          Show
          Sreekanth Ramakrishnan added a comment - Attaching new patch, incorporating Hemanths offline comment. Have tasks which have setup and cleanup.
          Hide
          Sreekanth Ramakrishnan added a comment -

          attaching latest patch. Incorporating following comments from Hemanth

          • Correct the uploaded patch.
          • Remove all while job complete wait loops in the test cases.
          Show
          Sreekanth Ramakrishnan added a comment - attaching latest patch. Incorporating following comments from Hemanth Correct the uploaded patch. Remove all while job complete wait loops in the test cases.
          Hide
          Sreekanth Ramakrishnan added a comment -

          Attaching a patch which does the following:

          1. Deleted the TestQueueCapacities.
          2. Renamed TestJobInitialization to TestCapacitySchedulerWithJobTracker.
          3. Added a new test case to TestCapacitySchedulerWithJobTracker which checks JobTracker integration.
          4. Modified build-contrib.xml to point to the correct location of examples classes

          Currently, jsp libs are not in the run time path of junit. So it has to be manually added to the path otherwise reduce fetches would fail. The JIRA tracking that issue is MAPREDUCE-694

          Show
          Sreekanth Ramakrishnan added a comment - Attaching a patch which does the following: 1. Deleted the TestQueueCapacities . 2. Renamed TestJobInitialization to TestCapacitySchedulerWithJobTracker . 3. Added a new test case to TestCapacitySchedulerWithJobTracker which checks JobTracker integration. 4. Modified build-contrib.xml to point to the correct location of examples classes Currently, jsp libs are not in the run time path of junit. So it has to be manually added to the path otherwise reduce fetches would fail. The JIRA tracking that issue is MAPREDUCE-694
          Hide
          Sreekanth Ramakrishnan added a comment -

          Need to fix JSP libraries for reduces of jobs to work.

          Show
          Sreekanth Ramakrishnan added a comment - Need to fix JSP libraries for reduces of jobs to work.
          Hide
          Hemanth Yamijala added a comment -

          It would be good to determine a goal for this test case. I think many of the tests mentioned above (queue capacity expansion, new tasks starting as a previous wave finishes), are all adequately tested by the testcases in TestCapacityScheduler.

          I think the original intent of this test case was primarily to define a smoke test that integrates the capacity scheduler with the M/R framework - particularly jobtracker - and make sure that submitted jobs get scheduled and run. Whether the right scheduling decisions are taken is testing the capacity scheduler, and that is already done (or can be done if missing) in TestCapacityScheduler.

          So, what if we simply have one test case that runs small jobs requiring less, equal and more of the capacity of 2 queues (50%) each and just make sure they all run and complete successfully ? Would that be simple enough to work while retaining the original goal ?

          Show
          Hemanth Yamijala added a comment - It would be good to determine a goal for this test case. I think many of the tests mentioned above (queue capacity expansion, new tasks starting as a previous wave finishes), are all adequately tested by the testcases in TestCapacityScheduler. I think the original intent of this test case was primarily to define a smoke test that integrates the capacity scheduler with the M/R framework - particularly jobtracker - and make sure that submitted jobs get scheduled and run. Whether the right scheduling decisions are taken is testing the capacity scheduler, and that is already done (or can be done if missing) in TestCapacityScheduler. So, what if we simply have one test case that runs small jobs requiring less, equal and more of the capacity of 2 queues (50%) each and just make sure they all run and complete successfully ? Would that be simple enough to work while retaining the original goal ?
          Hide
          Sreekanth Ramakrishnan added a comment -

          Following are the tests which are executed by org.apache.hadoop.mapred.TestQueueCapacities

          • Test job submission with a single queue configured. Single queue has 100% cluster-capacity. Submit a job which requires more than 100% of capacity. Ensures that the job completes successfully.
          • Test multiple job submissions with single queue configured. Single queue has 100% cluster-capacity. Submit a job j1 which requires the 100% of queue capacity. Wait till all slots are occupied by the j1 and then submit j2. Observe all tasks from J1 is completed before attending to tasks from J2.
          • Test multiple job submissions with single queue configured. Single queue has 100% cluster-capacity. Submit a job j1 which requires less than queue capacity. Wait till all tasks start running of J1 and then submit job J2. Observe that J2 tasks get scheduled. Let both jobs run to completion.
          • Test single job with multiple queues configured. Submit a single job which requires 100% oc cluster-capacity. Observe that the job fully occupies the cluster, i.e. queue capacity expands to statisfy the job.
          • Test multiple jobs with multiple queues configured. Submit jobs equal to number of queues configured, which requires exactly the capacities of queues configured. Observe all the jobs simultaneously starts running at same time.

          Can we just combine the five tests above to a single test below to improve run time and make it simpler?

          • Configure Capacity-Scheduler with two queues each getting 50% of cluster-capacity.
          • Submit a job j1 to q1 to occupy 100% of cluster capacity.
          • Wait till job j1 occupies 100% of cluster capacity, submit J2 to q2.
          • Observe that when J1's task starts finishing we start scheduling tasks from J2 to expand capacity for q2.
          • Submit a job j3 in q1 and observe that when q1's capacity has been contracted to 50% of capacity and j1's tasks gettig finished j3 is scheduled.
          • Observe all the tasks from jobs complete fully.
          Show
          Sreekanth Ramakrishnan added a comment - Following are the tests which are executed by org.apache.hadoop.mapred.TestQueueCapacities Test job submission with a single queue configured. Single queue has 100% cluster-capacity. Submit a job which requires more than 100% of capacity. Ensures that the job completes successfully. Test multiple job submissions with single queue configured. Single queue has 100% cluster-capacity. Submit a job j1 which requires the 100% of queue capacity. Wait till all slots are occupied by the j1 and then submit j2. Observe all tasks from J1 is completed before attending to tasks from J2. Test multiple job submissions with single queue configured. Single queue has 100% cluster-capacity. Submit a job j1 which requires less than queue capacity. Wait till all tasks start running of J1 and then submit job J2. Observe that J2 tasks get scheduled. Let both jobs run to completion. Test single job with multiple queues configured. Submit a single job which requires 100% oc cluster-capacity. Observe that the job fully occupies the cluster, i.e. queue capacity expands to statisfy the job. Test multiple jobs with multiple queues configured. Submit jobs equal to number of queues configured, which requires exactly the capacities of queues configured. Observe all the jobs simultaneously starts running at same time. Can we just combine the five tests above to a single test below to improve run time and make it simpler? Configure Capacity-Scheduler with two queues each getting 50% of cluster-capacity. Submit a job j1 to q1 to occupy 100% of cluster capacity. Wait till job j1 occupies 100% of cluster capacity, submit J2 to q2. Observe that when J1's task starts finishing we start scheduling tasks from J2 to expand capacity for q2. Submit a job j3 in q1 and observe that when q1's capacity has been contracted to 50% of capacity and j1's tasks gettig finished j3 is scheduled. Observe all the tasks from jobs complete fully.
          Hide
          Hemanth Yamijala added a comment -

          Just for information, the failure this time around happened as follows:

          • The test timed out in multipleQsWithOneQBeyondCapacity, while waiting for 5 map tasks to complete.
          • The check for completion of tasks assumes all map tasks run successfully in ControlledMapReduceJob. Note that the check is on jip.finishedMaps() which does not count failed tasks.
          • However, one of the map tasks failed this time, with the following stack trace:
                [junit] 09/06/17 12:49:20 INFO mapred.TaskInProgress: Error from attempt_200906171248_0001_m_000003_0: java.io.FileNotFoundException: File signalFileDir-7646601804912829477/MAPS_0 does not exist.
                [junit]   at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
                [junit]   at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:301)
                [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746)
                [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:771)
                [junit]   at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465)
                [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746)
                [junit]   at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:806)
                [junit]   at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:936)
                [junit]   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:891)
                [junit]   at org.apache.hadoop.mapred.ControlledMapReduceJob.listSignalFiles(ControlledMapReduceJob.java:278)
                [junit]   at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:318)
                [junit]   at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:60)
                [junit]   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
                [junit]   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
                [junit]   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
                [junit]   at org.apache.hadoop.mapred.Child.main(Child.java:159)
            
          • This, in turn, seems to relate to the problem described in HADOOP-4167. The mappers all list contents of a filesystem looking for 'signal' files. These signal files are renamed and therefore go missing asynchronously.
          • The test waits forever and thus times out.
          Show
          Hemanth Yamijala added a comment - Just for information, the failure this time around happened as follows: The test timed out in multipleQsWithOneQBeyondCapacity, while waiting for 5 map tasks to complete. The check for completion of tasks assumes all map tasks run successfully in ControlledMapReduceJob. Note that the check is on jip.finishedMaps() which does not count failed tasks. However, one of the map tasks failed this time, with the following stack trace: [junit] 09/06/17 12:49:20 INFO mapred.TaskInProgress: Error from attempt_200906171248_0001_m_000003_0: java.io.FileNotFoundException: File signalFileDir-7646601804912829477/MAPS_0 does not exist. [junit] at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383) [junit] at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:301) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:771) [junit] at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:465) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:746) [junit] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:806) [junit] at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:936) [junit] at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:891) [junit] at org.apache.hadoop.mapred.ControlledMapReduceJob.listSignalFiles(ControlledMapReduceJob.java:278) [junit] at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:318) [junit] at org.apache.hadoop.mapred.ControlledMapReduceJob.map(ControlledMapReduceJob.java:60) [junit] at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) [junit] at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363) [junit] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312) [junit] at org.apache.hadoop.mapred.Child.main(Child.java:159) This, in turn, seems to relate to the problem described in HADOOP-4167 . The mappers all list contents of a filesystem looking for 'signal' files. These signal files are renamed and therefore go missing asynchronously. The test waits forever and thus times out.

            People

            • Assignee:
              Sreekanth Ramakrishnan
              Reporter:
              Hemanth Yamijala
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development