Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Vision:

      We want to build a Simulator to simulate large-scale Hadoop clusters, applications and workloads. This would be invaluable in furthering Hadoop by providing a tool for researchers and developers to prototype features (e.g. pluggable block-placement for HDFS, Map-Reduce schedulers etc.) and predict their behaviour and performance with reasonable amount of confidence, there-by aiding rapid innovation.


      First Cut: Simulator for the Map-Reduce Scheduler

      The Map-Reduce Scheduler is a fertile area of interest with at least four schedulers, each with their own set of features, currently in existence: Default Scheduler, Capacity Scheduler, Fairshare Scheduler & Priority Scheduler.

      Each scheduler's scheduling decisions are driven by many factors, such as fairness, capacity guarantee, resource availability, data-locality etc.

      Given that, it is non-trivial to accurately choose a single scheduler or even a set of desired features to predict the right scheduler (or features) for a given workload. Hence a simulator which can predict how well a particular scheduler works for some specific workload by quickly iterating over schedulers and/or scheduler features would be quite useful.

      So, the first cut is to implement a simulator for the Map-Reduce scheduler which take as input a job trace derived from production workload and a cluster definition, and simulates the execution of the jobs in as defined in the trace in this virtual cluster. As output, the detailed job execution trace (recorded in relation to virtual simulated time) could then be analyzed to understand various traits of individual schedulers (individual jobs turn around time, throughput, faireness, capacity guarantee, etc). To support this, we would need a simulator which could accurately model the conditions of the actual system which would affect a schedulers decisions. These include very large-scale clusters (thousands of nodes), the detailed characteristics of the workload thrown at the clusters, job or task failures, data locality, and cluster hardware (cpu, memory, disk i/o, network i/o, network topology) etc.

      1. mapreduce-728-20090918-6.patch
        844 kB
        Hong Tang
      2. mapreduce-728-20090918-5.patch
        844 kB
        Hong Tang
      3. mapreduce-728-20090918-3.patch
        844 kB
        Hong Tang
      4. mapreduce-728-20090918-2.patch
        842 kB
        Hong Tang
      5. mapreduce-728-20090918.patch
        842 kB
        Hong Tang
      6. mapreduce-728-20090917-4.patch
        842 kB
        Hong Tang
      7. mapreduce-728-20090917-3.patch
        840 kB
        Hong Tang
      8. mapreduce-728-20090917.patch
        157 kB
        Hong Tang
      9. 19-jobs.trace.json.gz
        594 kB
        Hong Tang
      10. 19-jobs.topology.json.gz
        5 kB
        Hong Tang
      11. mumak.png
        44 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          arunkumar added a comment -

          Tamas and Anirban,thanks for the Clarification.
          you got my question's right.

          I was also interested in knowing the following :
          > Are there any benchmarks for Mumak ?
          > How do i change the preferred locations(apart from manually) in the job trace obtained from Rumen?
          > Can i add some fields to the job trace generated using Rumen ?

          Thanks,
          Arun

          Show
          arunkumar added a comment - Tamas and Anirban,thanks for the Clarification. you got my question's right. I was also interested in knowing the following : > Are there any benchmarks for Mumak ? > How do i change the preferred locations(apart from manually) in the job trace obtained from Rumen? > Can i add some fields to the job trace generated using Rumen ? Thanks, Arun
          Hide
          Tamas Sarlos added a comment -

          Hi Arun,

          We do not control or simulate anything about the placement of a job's output data. As for the effects of task placement and the input data of a job, we rely on the input split locations recorded in the Hadoop job logs. The locality of the map task's input plays a role in determining the run time of the map task in the simulation. We differentiate among 3 levels of locality based on the closest input split to the task tracker: same node, same rack, and cross rack. We parse the Hadoop job logs using org.apache.hadoop.tools.rumen.

          In details, with pointers to the code:
          SimulatorTaskTracker.java:738 sets the run time of the task in the
          simulation; the relevant
          SimulatorTaskTracker.SimulatorTaskInProgress.userSpaceRunTime comes from
          org.apache.hadoop.tools.rumen.TaskAttemptInfo,
          org.apache.hadoop.tools.rumen.MapTaskAttemptInfo, and
          org.apache.hadoop.tools.rumen.ReduceTaskAttemptInfo

          These are set by the SimulatorJobTracker.java:443
          using SimulatorJobInProgress.getTaskAttemptInfo(taskTracker, taskAttemptID)
          which uses SimulatorJobStory, which is just a wrapper around the
          org.apache.hadoop.tools.rumen.JobStory interface.

          The latter is actually implemented by org.apache.hadoop.tools.rumen.ZombieJob, which uses the logged runtime of task attempts and simple heuristics based on the locality of the taskTracker to the input splits to make up a run time for the task.

          If you wanted to alter the effect of task placement on run time, we suggest to modify the ZombieJob class.

          Please let us know if we understood your question correctly.

          Best,
          Tamas and Anirban

          Show
          Tamas Sarlos added a comment - Hi Arun, We do not control or simulate anything about the placement of a job's output data. As for the effects of task placement and the input data of a job, we rely on the input split locations recorded in the Hadoop job logs. The locality of the map task's input plays a role in determining the run time of the map task in the simulation. We differentiate among 3 levels of locality based on the closest input split to the task tracker: same node, same rack, and cross rack. We parse the Hadoop job logs using org.apache.hadoop.tools.rumen. In details, with pointers to the code: SimulatorTaskTracker.java:738 sets the run time of the task in the simulation; the relevant SimulatorTaskTracker.SimulatorTaskInProgress.userSpaceRunTime comes from org.apache.hadoop.tools.rumen.TaskAttemptInfo, org.apache.hadoop.tools.rumen.MapTaskAttemptInfo, and org.apache.hadoop.tools.rumen.ReduceTaskAttemptInfo These are set by the SimulatorJobTracker.java:443 using SimulatorJobInProgress.getTaskAttemptInfo(taskTracker, taskAttemptID) which uses SimulatorJobStory, which is just a wrapper around the org.apache.hadoop.tools.rumen.JobStory interface. The latter is actually implemented by org.apache.hadoop.tools.rumen.ZombieJob, which uses the logged runtime of task attempts and simple heuristics based on the locality of the taskTracker to the input splits to make up a run time for the task. If you wanted to alter the effect of task placement on run time, we suggest to modify the ZombieJob class. Please let us know if we understood your question correctly. Best, Tamas and Anirban
          Hide
          arunkumar added a comment -

          Few Clarification Questions :
          Q>How does mumak place the per job data on the simulated nodes ? I am interested in controlling the placement of data of every job from the Job trace.
          Q>Which classes do i need to modify and what has to be done for this ?

          Show
          arunkumar added a comment - Few Clarification Questions : Q>How does mumak place the per job data on the simulated nodes ? I am interested in controlling the placement of data of every job from the Job trace. Q>Which classes do i need to modify and what has to be done for this ?
          Hide
          Chris Douglas added a comment -

          Documentation is being added in MAPREDUCE-1056

          Show
          Chris Douglas added a comment - Documentation is being added in MAPREDUCE-1056
          Hide
          Matei Zaharia added a comment -

          Is there any documentation on Mumak, or is this JIRA the current place to find out about it?

          Show
          Matei Zaharia added a comment - Is there any documentation on Mumak, or is this JIRA the current place to find out about it?
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk-Commit #64 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/64/)
          . Add Mumak, a Hadoop map/reduce simulator. Contributed by Arun C Murthy,
          Tamas Sarlos, Anirban Dasgupta, Guanying Wang, and Hong Tang

          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #64 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/64/ ) . Add Mumak, a Hadoop map/reduce simulator. Contributed by Arun C Murthy, Tamas Sarlos, Anirban Dasgupta, Guanying Wang, and Hong Tang
          Hide
          Chris Douglas added a comment -

          +1

          I committed this to trunk and the 0.21 branch, per the vote on mapreduce-dev.

          Thanks to Arun Murthy, Tamas Sarlos, Anirban Dasgupta, Guanying Wang, and Hong Tang.

          Show
          Chris Douglas added a comment - +1 I committed this to trunk and the 0.21 branch, per the vote on mapreduce-dev. Thanks to Arun Murthy, Tamas Sarlos, Anirban Dasgupta, Guanying Wang, and Hong Tang.
          Hide
          Hong Tang added a comment -

          The failed test org.apache.hadoop.tools.TestCopyFiles.testHftpAccessControl (from TestCopyFiles) is not related.

          Show
          Hong Tang added a comment - The failed test org.apache.hadoop.tools.TestCopyFiles.testHftpAccessControl (from TestCopyFiles) is not related.
          Hide
          Chris Douglas added a comment -

          The failing core test, TestCopyFiles.testHftpAccessControl also fails on trunk ( MAPREDUCE-1029 )

          Show
          Chris Douglas added a comment - The failing core test, TestCopyFiles.testHftpAccessControl also fails on trunk ( MAPREDUCE-1029 )
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12420122/mapreduce-728-20090918-6.patch
          against trunk revision 818577.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 30 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420122/mapreduce-728-20090918-6.patch against trunk revision 818577. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 30 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/129/console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12420122/mapreduce-728-20090918-6.patch
          against trunk revision 817740.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 30 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12420122/mapreduce-728-20090918-6.patch against trunk revision 817740. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 30 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/124/console This message is automatically generated.
          Hide
          Hong Tang added a comment -

          test-patch passed on both trunk and hadoop-0.21 branch on my local machine.

          Show
          Hong Tang added a comment - test-patch passed on both trunk and hadoop-0.21 branch on my local machine.
          Hide
          Hong Tang added a comment -

          test-patch, contrib-tests passed on my local machine.

          Show
          Hong Tang added a comment - test-patch, contrib-tests passed on my local machine.
          Hide
          Hong Tang added a comment -

          Fixed problem introduced by MAPREDUCE-980.

          Show
          Hong Tang added a comment - Fixed problem introduced by MAPREDUCE-980 .
          Hide
          Hong Tang added a comment -

          Given that the reason it was broken is due to the very late check in of MAPREDUCE-980, I'd like to request an extension of this patch to go in to 21.

          Show
          Hong Tang added a comment - Given that the reason it was broken is due to the very late check in of MAPREDUCE-980 , I'd like to request an extension of this patch to go in to 21.
          Hide
          Chris Douglas added a comment -

          MAPREDUCE-980 broke the patch in an unrecoverable way. This cannot make feature freeze.

          Show
          Chris Douglas added a comment - MAPREDUCE-980 broke the patch in an unrecoverable way. This cannot make feature freeze.
          Hide
          Hong Tang added a comment -

          Patch added some changes that were lost btw 20090918 and 20090917-4

          Show
          Hong Tang added a comment - Patch added some changes that were lost btw 20090918 and 20090917-4
          Hide
          Chris Douglas added a comment -
               [exec] -1 overall.
               [exec]
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec]
               [exec]     +1 tests included.  The patch appears to include 30 new or modified tests.
               [exec]
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec]
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec]
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec]
               [exec]     -1 release audit.  The applied patch generated 180 release audit warnings (more than the trunk's current 176 warnings).
          

          These files need license headers:

          src/contrib/mumak/src/java/org/apache/hadoop/mapred/SimulatorClock.java
          src/contrib/mumak/src/java/org/apache/hadoop/mapred/SimulatorJobStory.java
          

          The two .gz files don't need license headers, of course.

          Show
          Chris Douglas added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 30 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 180 release audit warnings (more than the trunk's current 176 warnings). These files need license headers: src/contrib/mumak/src/java/org/apache/hadoop/mapred/SimulatorClock.java src/contrib/mumak/src/java/org/apache/hadoop/mapred/SimulatorJobStory.java The two .gz files don't need license headers, of course.
          Hide
          Hong Tang added a comment -

          New patch that resolves a conflict in src/contrib/build-contrib.xml.

          Show
          Hong Tang added a comment - New patch that resolves a conflict in src/contrib/build-contrib.xml.
          Hide
          Hong Tang added a comment -

          Patch that fixes problems caused by API changes in MAPREDUCE-777.

          Show
          Hong Tang added a comment - Patch that fixes problems caused by API changes in MAPREDUCE-777 .
          Hide
          Hong Tang added a comment -

          Patch broken due to MAPREDUCE-777.

          Show
          Hong Tang added a comment - Patch broken due to MAPREDUCE-777 .
          Hide
          Konstantin Boudnik added a comment -

          Since this exposes some private JobTracker/JobInProgress fields to subclasses, any fields that can be made final, should be. bq. The copied JobTracker constructor is regrettable

          I believe both issues could be solve through appropriate use of AspectJ. For the first case one might use privileged aspects to access private members via generated getters.

          The second one might be possible to resolve by aspect generated constructor in the appropriate class.

          Show
          Konstantin Boudnik added a comment - Since this exposes some private JobTracker/JobInProgress fields to subclasses, any fields that can be made final, should be. bq. The copied JobTracker constructor is regrettable I believe both issues could be solve through appropriate use of AspectJ. For the first case one might use privileged aspects to access private members via generated getters. The second one might be possible to resolve by aspect generated constructor in the appropriate class.
          Hide
          Hong Tang added a comment -

          The attached patch is the first step toward many things that it could enable us to do. The following is a follow-up of previous comments in this jira.

          I have one item of high-level feedback. It looks like Mumak has two components - a simulator and a trace-driven workload generator. It would be nice if the workload generator was pluggable so that the simulator could be used on synthetic workloads without requiring a trace. For example, one should be able to create a simulated cluster where some given node is always slow, or fails partway through, etc. Then the simulator could be used in unit tests, simplifying a lot of the testing code in various schedulers.

          In the patch, this is very close to what we did (after MAPREDUCE-966). The dependency between Mumak and Rumen (the load generator) comes down to four interfaces: JobStory, JobStoryProducer, ClusterStory. And currently JobStoryProducer maps to SimulatorJobStory, and ClusterStory maps to ZombieCluster. It should be very easy to make them plugable, and I will create a Jira to track this.

          Then the simulator could be used in unit tests, simplifying a lot of the testing code in various schedulers.

          Yes, the unit tests included were written in this way. And it showed two possible bugs in recent changes in JobHistory by MAPREDUCE-157 (MAPREDUCE-995, and MAPREDUCE-1000), this is actually a pleasant surprise to me, we were thinking of using Mumak for design proof or performance validation, but our design choice to use the actual scheduler code and JT code also makes it a JT debugger.

          What will be done about speculative tasks? • Will Mumak simulate high-memory jobs?

          Neither is done in this patch. But I agree these are things we should simulator in follow-up improvements.

          The schedulers and the JobTracker currently have some threads that perform an operation periodically and sleep in-between doing so.

          We (partially) solve the problem by using aspectJ so that the threads would become no-op. The reason I say it is a partial solution is that the threads are still active, and that the interception logic is not something you can mechanically determine, but in most cases are very straightforward to identify.

          Show
          Hong Tang added a comment - The attached patch is the first step toward many things that it could enable us to do. The following is a follow-up of previous comments in this jira. I have one item of high-level feedback. It looks like Mumak has two components - a simulator and a trace-driven workload generator. It would be nice if the workload generator was pluggable so that the simulator could be used on synthetic workloads without requiring a trace. For example, one should be able to create a simulated cluster where some given node is always slow, or fails partway through, etc. Then the simulator could be used in unit tests, simplifying a lot of the testing code in various schedulers. In the patch, this is very close to what we did (after MAPREDUCE-966 ). The dependency between Mumak and Rumen (the load generator) comes down to four interfaces: JobStory, JobStoryProducer, ClusterStory. And currently JobStoryProducer maps to SimulatorJobStory, and ClusterStory maps to ZombieCluster. It should be very easy to make them plugable, and I will create a Jira to track this. Then the simulator could be used in unit tests, simplifying a lot of the testing code in various schedulers. Yes, the unit tests included were written in this way. And it showed two possible bugs in recent changes in JobHistory by MAPREDUCE-157 ( MAPREDUCE-995 , and MAPREDUCE-1000 ), this is actually a pleasant surprise to me, we were thinking of using Mumak for design proof or performance validation, but our design choice to use the actual scheduler code and JT code also makes it a JT debugger. What will be done about speculative tasks? • Will Mumak simulate high-memory jobs? Neither is done in this patch. But I agree these are things we should simulator in follow-up improvements. The schedulers and the JobTracker currently have some threads that perform an operation periodically and sleep in-between doing so. We (partially) solve the problem by using aspectJ so that the threads would become no-op. The reason I say it is a partial solution is that the threads are still active, and that the interception logic is not something you can mechanically determine, but in most cases are very straightforward to identify.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12419962/mapreduce-728-20090917-4.patch
          against trunk revision 816476.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 30 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12419962/mapreduce-728-20090917-4.patch against trunk revision 816476. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 30 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/102/console This message is automatically generated.
          Hide
          Hong Tang added a comment -

          test-patch passed on my local machine now:

               [exec] +1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 30 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          
          Show
          Hong Tang added a comment - test-patch passed on my local machine now: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 30 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          Hong Tang added a comment -

          Fixed release audit warnings.

          Show
          Hong Tang added a comment - Fixed release audit warnings.
          Hide
          Hong Tang added a comment -

          Need to apply patch from MAPREDUCE-995 before testing this patch.

          Show
          Hong Tang added a comment - Need to apply patch from MAPREDUCE-995 before testing this patch.
          Hide
          Hong Tang added a comment -

          Incorporated comments from Chris. Also used "git diff --text" trick to include the binary .gz files int eh patch.

          Show
          Hong Tang added a comment - Incorporated comments from Chris. Also used "git diff --text" trick to include the binary .gz files int eh patch.
          Hide
          Chris Douglas added a comment -

          (unit tests, etc. being prepared)

          • Since this exposes some private JobTracker/JobInProgress fields to subclasses, any fields that can be made final, should be. Also, there doesn't seem to be a consistent protected/package-private strategy for these members. Since SimulatorJobTracker is in the same package, most should be package-private.
          • The copied JobTracker constructor is regrettable, but reworking the code to support this is a large undertaking that can be postponed. The same can be said of SimulatorJobInProgress. A JIRA tracking a cleanup these APIs would be appropriate.

          I have not stepped through the details of the simluator code.

          Show
          Chris Douglas added a comment - (unit tests, etc. being prepared) Since this exposes some private JobTracker/JobInProgress fields to subclasses, any fields that can be made final, should be. Also, there doesn't seem to be a consistent protected/package-private strategy for these members. Since SimulatorJobTracker is in the same package, most should be package-private. The copied JobTracker constructor is regrettable, but reworking the code to support this is a large undertaking that can be postponed. The same can be said of SimulatorJobInProgress. A JIRA tracking a cleanup these APIs would be appropriate. I have not stepped through the details of the simluator code.
          Hide
          Konstantin Boudnik added a comment -

          I'd like to point out that AspectJ related modifications of the build file are likely to be modified significantly as soon as HADOOP-6204 is completed and committed.
          +1 on using AspectJ smartly!

          Show
          Konstantin Boudnik added a comment - I'd like to point out that AspectJ related modifications of the build file are likely to be modified significantly as soon as HADOOP-6204 is completed and committed. +1 on using AspectJ smartly!
          Hide
          Hong Tang added a comment -

          To review/commit the patch, follow the steps below:

          Show
          Hong Tang added a comment - To review/commit the patch, follow the steps below: apply patch http://issues.apache.org/jira/secure/attachment/12419875/mapred-995-v1.patch apply patch mapreduce-728-20090917.patch download the two json.gz files and store them under src/contrib/mumak/src/test/data "ant jar tools" "cd src/contrib/mumak && ant test"
          Hide
          Hong Tang added a comment -

          After applying the patch uploaded on Jira MAPREDUCE-995, all Mumak unit tests pass. Also ran findbugs and run-commit-tests, both pass.

          Show
          Hong Tang added a comment - After applying the patch uploaded on Jira MAPREDUCE-995 , all Mumak unit tests pass. Also ran findbugs and run-commit-tests, both pass.
          Hide
          Hong Tang added a comment -

          These two trace files should go to src/contrib/mumak/src/test/data.

          Show
          Hong Tang added a comment - These two trace files should go to src/contrib/mumak/src/test/data.
          Hide
          Hong Tang added a comment -

          Ok, this is the patch that is close to the final form. The unit tests currently would fail due to MAPREDUCE-995.

          Also, the end-to-end unit test requires two trace files in gzip format. I will upload them separately.

          Show
          Hong Tang added a comment - Ok, this is the patch that is close to the final form. The unit tests currently would fail due to MAPREDUCE-995 . Also, the end-to-end unit test requires two trace files in gzip format. I will upload them separately.
          Hide
          Matei Zaharia added a comment -

          Oh, never mind. I was looking at a CapacityScheduler from an older Hadoop version in Eclipse for some reason.

          Show
          Matei Zaharia added a comment - Oh, never mind. I was looking at a CapacityScheduler from an older Hadoop version in Eclipse for some reason.
          Hide
          Matei Zaharia added a comment -

          Sorry, I just did a svn update and I see a reclaimCapacityThread created at line 1145 of CapacityTaskScheduler.java. Is this change just not committed yet?

          Show
          Matei Zaharia added a comment - Sorry, I just did a svn update and I see a reclaimCapacityThread created at line 1145 of CapacityTaskScheduler.java. Is this change just not committed yet?
          Hide
          Hemanth Yamijala added a comment -

          Arun / Matei,

          The trunk version of Capacity scheduler does not have a thread to reclaim capacities. But, as Matei pointed out, we do have a thread for job initialization.

          Show
          Hemanth Yamijala added a comment - Arun / Matei, The trunk version of Capacity scheduler does not have a thread to reclaim capacities. But, as Matei pointed out, we do have a thread for job initialization.
          Hide
          Matei Zaharia added a comment -

          Forgot to add, both schedulers also have a thread or a thread pool for initializing jobs, but those should be easy to mock out.

          Show
          Matei Zaharia added a comment - Forgot to add, both schedulers also have a thread or a thread pool for initializing jobs, but those should be easy to mock out.
          Hide
          Matei Zaharia added a comment -

          Agreed. I know for sure that neither the default or capacity-scheduler use threads, what about fair-share? How hard is it to stop using threads there?

          The capacity scheduler has a thread for reclaiming capacity (at least in my version of trunk that I've checked out).

          It would be easy to get rid of this thread and the ones in the fair scheduler by either having a schedulePeriodicTask API in the MapReduceMaster interface or perhaps by checking the current time in assignTasks() and running any computations whose interval has passed. I'm fine with either solution.

          Show
          Matei Zaharia added a comment - Agreed. I know for sure that neither the default or capacity-scheduler use threads, what about fair-share? How hard is it to stop using threads there? The capacity scheduler has a thread for reclaiming capacity (at least in my version of trunk that I've checked out). It would be easy to get rid of this thread and the ones in the fair scheduler by either having a schedulePeriodicTask API in the MapReduceMaster interface or perhaps by checking the current time in assignTasks() and running any computations whose interval has passed. I'm fine with either solution.
          Hide
          Arun C Murthy added a comment -

          Clarification question: If the per-job digest from the "source" workload is generated by a real Hadoop cluster, then that workload would be an artifact of all the pluggable components used in the cluster used to generate it?

          The characteristics of the workload (e.g. for a given job j, runtime for data-local maps, off-rack maps etc.) are reasonably independent of the Scheduler in question.

          For instance, if I had a cluster running the default scheduler, and I took those job digests and throw those into Mumak, then the question that Mumak is supposed to answer is, given a workload of a set of jobs under the Default Scheduler, how different would the execution times be under some different set of pluggable components?

          Yes. With 'pluggable components' limited to the Scheduler.

          Crucially, an equally important (if not more so) role of Mumak is to help us answer the question: What will be turnaround of the the same workload be if we added a feature 'X' to the same scheduler 'Y'?

          Show
          Arun C Murthy added a comment - Clarification question: If the per-job digest from the "source" workload is generated by a real Hadoop cluster, then that workload would be an artifact of all the pluggable components used in the cluster used to generate it? The characteristics of the workload (e.g. for a given job j, runtime for data-local maps, off-rack maps etc.) are reasonably independent of the Scheduler in question. For instance, if I had a cluster running the default scheduler, and I took those job digests and throw those into Mumak, then the question that Mumak is supposed to answer is, given a workload of a set of jobs under the Default Scheduler, how different would the execution times be under some different set of pluggable components? Yes. With 'pluggable components' limited to the Scheduler. Crucially, an equally important (if not more so) role of Mumak is to help us answer the question: What will be turnaround of the the same workload be if we added a feature 'X' to the same scheduler 'Y'?
          Hide
          Arun C Murthy added a comment -

          I have one item of high-level feedback. It looks like Mumak has two components - a simulator and a trace-driven workload generator. It would be nice if the workload generator was pluggable so that the simulator could be used on synthetic workloads without requiring a trace.

          The proposal is for Mumak to work with Rumen (whose jira is coming along soon) which expose necessary apis to let us 'query' Rumen for characteristics of the workload (e.g. for a given job j how long did a data-local map-task take? or a off-rack one take?). So, yes you could seed Rumen with a synthetic trace and run Mumak against it.

          What will be done about speculative tasks?

          For V1 we plan to ignore speculation. It is considerably harder to simulate per-task progress and thus the plan is to push it to a future release.

          Will Mumak simulate high-memory jobs?

          Yes!

          The schedulers and the JobTracker currently have some threads that perform an operation periodically and sleep in-between doing so. To make these work in a simulator, I think we have to make these pieces of code not use threads [...]

          Agreed. I know for sure that neither the default or capacity-scheduler use threads, what about fair-share? How hard is it to stop using threads there?

          Calls to System.currentTimeMillis will have to be replaced by use of Clock throughout the schedulers.

          +1
          As you'll see when we put up our work we use a 'virtual time' throughout Mumak which we will use to seed JobTracker.clock.

          Show
          Arun C Murthy added a comment - I have one item of high-level feedback. It looks like Mumak has two components - a simulator and a trace-driven workload generator. It would be nice if the workload generator was pluggable so that the simulator could be used on synthetic workloads without requiring a trace. The proposal is for Mumak to work with Rumen (whose jira is coming along soon) which expose necessary apis to let us 'query' Rumen for characteristics of the workload (e.g. for a given job j how long did a data-local map-task take? or a off-rack one take?). So, yes you could seed Rumen with a synthetic trace and run Mumak against it. What will be done about speculative tasks? For V1 we plan to ignore speculation. It is considerably harder to simulate per-task progress and thus the plan is to push it to a future release. Will Mumak simulate high-memory jobs? Yes! The schedulers and the JobTracker currently have some threads that perform an operation periodically and sleep in-between doing so. To make these work in a simulator, I think we have to make these pieces of code not use threads [...] Agreed. I know for sure that neither the default or capacity-scheduler use threads, what about fair-share? How hard is it to stop using threads there? Calls to System.currentTimeMillis will have to be replaced by use of Clock throughout the schedulers. +1 As you'll see when we put up our work we use a 'virtual time' throughout Mumak which we will use to seed JobTracker.clock.
          Hide
          Jiaqi Tan added a comment -

          Clarification question: If the per-job digest from the "source" workload is generated by a real Hadoop cluster, then that workload would be an artifact of all the pluggable components used in the cluster used to generate it? For instance, if I had a cluster running the default scheduler, and I took those job digests and throw those into Mumak, then the question that Mumak is supposed to answer is, given a workload of a set of jobs under the Default Scheduler, how different would the execution times be under some different set of pluggable components? I'm trying to understand what Mumak is meant to do.

          Show
          Jiaqi Tan added a comment - Clarification question: If the per-job digest from the "source" workload is generated by a real Hadoop cluster, then that workload would be an artifact of all the pluggable components used in the cluster used to generate it? For instance, if I had a cluster running the default scheduler, and I took those job digests and throw those into Mumak, then the question that Mumak is supposed to answer is, given a workload of a set of jobs under the Default Scheduler, how different would the execution times be under some different set of pluggable components? I'm trying to understand what Mumak is meant to do.
          Hide
          Matei Zaharia added a comment -

          This is looking good!

          I have one item of high-level feedback. It looks like Mumak has two components - a simulator and a trace-driven workload generator. It would be nice if the workload generator was pluggable so that the simulator could be used on synthetic workloads without requiring a trace. For example, one should be able to create a simulated cluster where some given node is always slow, or fails partway through, etc. Then the simulator could be used in unit tests, simplifying a lot of the testing code in various schedulers.

          Also, some questions about things that will be difficult to simulate:

          • What will be done about speculative tasks? The trace currently shows a second attempt being started and a first being killed. One option would be to make the first attempt take forever, but then you'd have to decide when to mark the task as speculatable in the simulated JobInProgress. Another option might be to always use the time of the fastest non-killed task attempt and forget about simulation in V1.
          • Will Mumak simulate high-memory jobs? That's one of the more interesting scheduling problems.
          • The schedulers and the JobTracker currently have some threads that perform an operation periodically and sleep in-between doing so. To make these work in a simulator, I think we have to make these pieces of code not use threads, and include an API in the JobTracker such as schedulePeriodically(Runnable runnable, long interval) so that these threads can run in simulated time.
          • Calls to System.currentTimeMillis will have to be replaced by use of Clock throughout the schedulers.
          Show
          Matei Zaharia added a comment - This is looking good! I have one item of high-level feedback. It looks like Mumak has two components - a simulator and a trace-driven workload generator. It would be nice if the workload generator was pluggable so that the simulator could be used on synthetic workloads without requiring a trace. For example, one should be able to create a simulated cluster where some given node is always slow, or fails partway through, etc. Then the simulator could be used in unit tests, simplifying a lot of the testing code in various schedulers. Also, some questions about things that will be difficult to simulate: What will be done about speculative tasks? The trace currently shows a second attempt being started and a first being killed. One option would be to make the first attempt take forever, but then you'd have to decide when to mark the task as speculatable in the simulated JobInProgress. Another option might be to always use the time of the fastest non-killed task attempt and forget about simulation in V1. Will Mumak simulate high-memory jobs? That's one of the more interesting scheduling problems. The schedulers and the JobTracker currently have some threads that perform an operation periodically and sleep in-between doing so. To make these work in a simulator, I think we have to make these pieces of code not use threads, and include an API in the JobTracker such as schedulePeriodically(Runnable runnable, long interval) so that these threads can run in simulated time. Calls to System.currentTimeMillis will have to be replaced by use of Clock throughout the schedulers.
          Hide
          Matei Zaharia added a comment -

          I'll take a look at the design in more detail. I think this would be really great to have for unit tests too.

          Show
          Matei Zaharia added a comment - I'll take a look at the design in more detail. I think this would be really great to have for unit tests too.
          Hide
          Arun C Murthy added a comment -

          I hasten to add that we are also working on a tool (internally we call it rumen) which will be tasked with gleaning information required for Mumak given a workload (i.e. a list of job-history files for starters).

          Show
          Arun C Murthy added a comment - I hasten to add that we are also working on a tool (internally we call it rumen ) which will be tasked with gleaning information required for Mumak given a workload (i.e. a list of job-history files for starters).
          Hide
          Arun C Murthy added a comment -

          Matei, we are proposing Mumak as a contrib module. We would love to collaborate. An explicit goal of Mumak is to work simulate using existing Hadoop simulators.

          Show
          Arun C Murthy added a comment - Matei, we are proposing Mumak as a contrib module. We would love to collaborate. An explicit goal of Mumak is to work simulate using existing Hadoop simulators.
          Hide
          Arun C Murthy added a comment -

          We propose to develop/contribute Mumak as a contrib module.

          Show
          Arun C Murthy added a comment - We propose to develop/contribute Mumak as a contrib module.
          Hide
          Arun C Murthy added a comment -

          Proposed architecture for Mumak.

          Show
          Arun C Murthy added a comment - Proposed architecture for Mumak.
          Hide
          Matei Zaharia added a comment -

          Arun, is this something you are developing / have developed at Yahoo, or is it more of a wish?

          I have a simple Ruby simulator for MR, and we are building a more detailed one for Nexus. However, neither of these runs existing Hadoop schedulers. If you think a Ruby simulator would be useful though, perhaps we can contribute one of these separately.

          Show
          Matei Zaharia added a comment - Arun, is this something you are developing / have developed at Yahoo, or is it more of a wish? I have a simple Ruby simulator for MR, and we are building a more detailed one for Nexus. However, neither of these runs existing Hadoop schedulers. If you think a Ruby simulator would be useful though, perhaps we can contribute one of these separately.
          Hide
          Arun C Murthy added a comment -

          Mumak 1.0

          The goal is to build a discrete event simulator to simulate conditions under which a Hadoop Map-Reduce Scheduler performs on a large-scale Map-Reduce cluster running a specific workload.

          Mumak takes as input a reasonably large workload (e.g. a month's worth of jobs from production cluster(s)) and simulates them in a matter of hours if not minutes on very few machines.

          What is it not?

          It is a non-goal to simulate the actual map/reduce tasks themselves.

          The scope of Version 1.0 does not include specifics of trying to simulate the actual workload itself. It will merely take a digest of the Hadoop Map-Reduce JobHistory of all jobs in the workload, and faithfully assume the actual run-time of individual tasks from the digest without simulating the tasks themselves. Clearly this will not try and simulate resources and their utilization on the actual tasktrackers, interaction between running tasks on the tasktrackers etc. The simulation of individual tasks is left for future versions.

          Some other simplifications are also made (mainly due to the lacking of such information from the job trace):

          • No job dependency. Jobs are faithfully submitted to the cluster as defined in the job trace.
          • No modeling of failure correlations (eg a few task attempts fail due to a node failure, but in the simulation run, the same set of task attempts may run on different nodes).

          What goes in? What comes out?

          The 'workload' alluded to in the previous sections needs elaboration. The proposal is to use the job-history for all jobs which are part of the workload. The Hadoop Map-Reduce per-job job-history is a very detailed log of each component task with run-times, counters etc. We can use this to generate a per-job digest with all relevant information. Thus, it is quite sufficient and feasible to collect workload from different clusters (research, production etc.) to be then used during simulation.

          More specifically, the following is a list of details it simulates:

          • It would simulate a cluster of the same size and network topology as where the source trace comes from. The reason for this restriction is because data locality is an important factor to the scheduler decisions and scaling the job traces obtained from cluster A and try to simulate it on cluster B with different diameters require a much thorough understanding.
          • It would simulate failures faithfully as recorded in the job trace. Namely, if a particular task attempt fails in the trace, it would fail in the simulation.
          • It would replay the same number of map tasks and reduce tasks as specified in the job digest.
          • It would use the inputsplit locations as are recorded in the job trace.

          The simulator will generate the same job-history for each of the simulated jobs. Thus we can use the same tools for slicing and dicing the output of the simulator.

          Design & Architecture

          Design Goals

          An overarching design goal for Mumak is that we should be able to use the exact same Map-Reduce Schedulers (listed above) as-is without any changes. This implies that we use the same interfaces used by Hadoop Map-Reduce so that it is trivial to plug-in the Scheduler of interest.

          Along the same lines it is a legitimate goal to use all relevant Hadoop Map-Reduce interfaces between various components so that it is trivial to replace each by the appropriate Hadoop Map-Reduce component (e.g. run the simulator in a emulation mode with real Map-Reduce clusters etc. in future).

          Architecture

          Mumak consists of the following components:

          • Discrete Event Simulator Engine with an event-queue
          • Simulated JobTracker
          • Simulated Cluster (set of tasktrackers)
          • Client for handling job-submission

          Engine

          The Simulator Engine is the heart of Mumak. It manages all the discrete events in virtual time and fires the appropriate handlers (JobClient, TaskTracker) when the events occur. Typically each event responded to by a component results in a new set of events to be fired in the future (virtual time).

          Some of the various event-types are:

          • HeartbeatEvent - An event which instructs a specific Tasktracker to send a heartbeat to the JobTracker.
          • TaskCompletionEvent - An event which denotes the completion (success/failure) of a specific map or reduce task which is sent to the TaskTracker.
          • JobSubmissionEvent - An event which instructs the JobClient to submit a specific job to the JobTracker

          Simulated JobTracker

          The JobTracker is driver for the Map-Reduce Scheduler. On receipt of heartbeats from various TaskTrackers it 'tracks' progress of the current jobs and forwards the appropriate information to the Scheduler to allow it to make the task-scheduling decisions. The simulated JobTracker uses the virtual time to allow the scheduler to make scheduling decisions.

          The JobTracker also uses the per-job digest to fill-in information about expected runtime for each of the tasks scheduled by the Scheduler to get Mumakil to simulate run-times for each task.

          The JobTracker is purely reactive in the sense that it only reacts to hearbeats sent by TaskTrackers. Further more it does not directly handle any events from the Engine, it only responds to the InterTrackerProtocol.heartbeat calls as in the real-world.

          Simulated Cluster

          The simulated cluster consists of an appropriate number of simulated TaskTrackers which respond to events generated by Engine. Each simulated TaskTracker maintains state about currently running tasks (all tasks are 'running' till an appropriate TaskCompletionEvent fires) and sends periodic status updates to the JobTracker on receipt of HeartbeatEvent.

          HeartbeatEvent

          When a HeartbeatEvent fires, the appropriate TaskTracker build status-reports for each of the running tasks and sends a hearbeat to the JobTracker (InterTrackerProtocol.heartbeat). The JobTracker updates its data-structures (JobInProgress, TaskInProgress etc.) to refect the latest state and forwards information to the Scheduler. If any new tasks are be to scheduled on this TaskTracker the JobTracker also fills in expected run-times for each via information gleaned from the job-digest. The TaskTracker then processes the instructions to launch the new tasks and responds to the Engine by inserting a set of new TaskCompletionEvents for the new tasks into the EventQueue.

          TaskCompletionEvent

          When a TaskCompletionEvent fires, the appropriate TaskTracker marks the relevant task as complete and forwards that information to the JobTracker on the next HeartbeatEvent.

          Simulated JobClient

          The JobClient responds to JobSubmissionEvents sent by the Engine and submits the appropriate jobs to the JobTracker via the standard JobSubmissionProtocol.

          Relevant Details

          Job Summary for Simulation

          The following can be derived from job history file by rumen:

          • Detailed job trace with properties and counters of each task attempt (of each task of each job in a workload).
          • Digest of jobs in a workload. From the jobs in the workload, we can derive statistical information of tasks to build a model which can help us fabricate tasks which not even scheduled to run (e.g. tasks of a failed job which were never run since the job was declared as FAILED soon after submission). Along the same lines, the digest will also have statistical details for helping modelling run-times for data-local maps, rack-local maps and off-rack maps based on data in the job-history logs. This is necessary for simulating tasks which might be scheduled on different nodes in the simulation run by the scheduler.

          How to deal with failure in workload?

          We will try to faithfully model task failures by replaying failed task-attempts by using information in the detailed job-traces.

          We also plan to build a simple statistical model of task failures which can then be used to simulate tasks which were never scheduled since the job failed early etc.

          Simulating Reduce Tasks

          In Mumak 1.0 we do not plan to simulate the running of the actual map/reduce tasks. Given that it is not possible to simulate the implicit dependency between completion of maps, the shuffle phase and the start of the reduce phase of the reduce tasks. Hence, we have decided to use a special AllMapsFinished event generated by the SimulatedJobTracker to trigger the start of the reduce-phase. For the same reasons, we have to model the total runtime of the reduce task as the summation of the time taken for completion of all maps and the time taken for individual task to complete the reduce-phase by itself. Thus, we are not going to try modelling the shuffle phase accurately.

          Furthermore, we will ignore map-task failures due to failed shuffles since we are not simulating the shuffle-phase.


          Thoughts?

          Show
          Arun C Murthy added a comment - Mumak 1.0 The goal is to build a discrete event simulator to simulate conditions under which a Hadoop Map-Reduce Scheduler performs on a large-scale Map-Reduce cluster running a specific workload. Mumak takes as input a reasonably large workload (e.g. a month's worth of jobs from production cluster(s)) and simulates them in a matter of hours if not minutes on very few machines. What is it not? It is a non-goal to simulate the actual map/reduce tasks themselves. The scope of Version 1.0 does not include specifics of trying to simulate the actual workload itself. It will merely take a digest of the Hadoop Map-Reduce JobHistory of all jobs in the workload, and faithfully assume the actual run-time of individual tasks from the digest without simulating the tasks themselves. Clearly this will not try and simulate resources and their utilization on the actual tasktrackers, interaction between running tasks on the tasktrackers etc. The simulation of individual tasks is left for future versions. Some other simplifications are also made (mainly due to the lacking of such information from the job trace): No job dependency. Jobs are faithfully submitted to the cluster as defined in the job trace. No modeling of failure correlations (eg a few task attempts fail due to a node failure, but in the simulation run, the same set of task attempts may run on different nodes). What goes in? What comes out? The 'workload' alluded to in the previous sections needs elaboration. The proposal is to use the job-history for all jobs which are part of the workload. The Hadoop Map-Reduce per-job job-history is a very detailed log of each component task with run-times, counters etc. We can use this to generate a per-job digest with all relevant information. Thus, it is quite sufficient and feasible to collect workload from different clusters (research, production etc.) to be then used during simulation. More specifically, the following is a list of details it simulates: It would simulate a cluster of the same size and network topology as where the source trace comes from. The reason for this restriction is because data locality is an important factor to the scheduler decisions and scaling the job traces obtained from cluster A and try to simulate it on cluster B with different diameters require a much thorough understanding. It would simulate failures faithfully as recorded in the job trace. Namely, if a particular task attempt fails in the trace, it would fail in the simulation. It would replay the same number of map tasks and reduce tasks as specified in the job digest. It would use the inputsplit locations as are recorded in the job trace. The simulator will generate the same job-history for each of the simulated jobs. Thus we can use the same tools for slicing and dicing the output of the simulator. Design & Architecture Design Goals An overarching design goal for Mumak is that we should be able to use the exact same Map-Reduce Schedulers (listed above) as-is without any changes. This implies that we use the same interfaces used by Hadoop Map-Reduce so that it is trivial to plug-in the Scheduler of interest. Along the same lines it is a legitimate goal to use all relevant Hadoop Map-Reduce interfaces between various components so that it is trivial to replace each by the appropriate Hadoop Map-Reduce component (e.g. run the simulator in a emulation mode with real Map-Reduce clusters etc. in future). Architecture Mumak consists of the following components: Discrete Event Simulator Engine with an event-queue Simulated JobTracker Simulated Cluster (set of tasktrackers) Client for handling job-submission Engine The Simulator Engine is the heart of Mumak. It manages all the discrete events in virtual time and fires the appropriate handlers (JobClient, TaskTracker) when the events occur. Typically each event responded to by a component results in a new set of events to be fired in the future (virtual time). Some of the various event-types are: HeartbeatEvent - An event which instructs a specific Tasktracker to send a heartbeat to the JobTracker. TaskCompletionEvent - An event which denotes the completion (success/failure) of a specific map or reduce task which is sent to the TaskTracker. JobSubmissionEvent - An event which instructs the JobClient to submit a specific job to the JobTracker Simulated JobTracker The JobTracker is driver for the Map-Reduce Scheduler. On receipt of heartbeats from various TaskTrackers it 'tracks' progress of the current jobs and forwards the appropriate information to the Scheduler to allow it to make the task-scheduling decisions. The simulated JobTracker uses the virtual time to allow the scheduler to make scheduling decisions. The JobTracker also uses the per-job digest to fill-in information about expected runtime for each of the tasks scheduled by the Scheduler to get Mumakil to simulate run-times for each task. The JobTracker is purely reactive in the sense that it only reacts to hearbeats sent by TaskTrackers. Further more it does not directly handle any events from the Engine, it only responds to the InterTrackerProtocol.heartbeat calls as in the real-world. Simulated Cluster The simulated cluster consists of an appropriate number of simulated TaskTrackers which respond to events generated by Engine. Each simulated TaskTracker maintains state about currently running tasks (all tasks are 'running' till an appropriate TaskCompletionEvent fires) and sends periodic status updates to the JobTracker on receipt of HeartbeatEvent. HeartbeatEvent When a HeartbeatEvent fires, the appropriate TaskTracker build status-reports for each of the running tasks and sends a hearbeat to the JobTracker (InterTrackerProtocol.heartbeat). The JobTracker updates its data-structures (JobInProgress, TaskInProgress etc.) to refect the latest state and forwards information to the Scheduler. If any new tasks are be to scheduled on this TaskTracker the JobTracker also fills in expected run-times for each via information gleaned from the job-digest. The TaskTracker then processes the instructions to launch the new tasks and responds to the Engine by inserting a set of new TaskCompletionEvents for the new tasks into the EventQueue. TaskCompletionEvent When a TaskCompletionEvent fires, the appropriate TaskTracker marks the relevant task as complete and forwards that information to the JobTracker on the next HeartbeatEvent. Simulated JobClient The JobClient responds to JobSubmissionEvents sent by the Engine and submits the appropriate jobs to the JobTracker via the standard JobSubmissionProtocol. Relevant Details Job Summary for Simulation The following can be derived from job history file by rumen: Detailed job trace with properties and counters of each task attempt (of each task of each job in a workload). Digest of jobs in a workload. From the jobs in the workload, we can derive statistical information of tasks to build a model which can help us fabricate tasks which not even scheduled to run (e.g. tasks of a failed job which were never run since the job was declared as FAILED soon after submission). Along the same lines, the digest will also have statistical details for helping modelling run-times for data-local maps, rack-local maps and off-rack maps based on data in the job-history logs. This is necessary for simulating tasks which might be scheduled on different nodes in the simulation run by the scheduler. How to deal with failure in workload? We will try to faithfully model task failures by replaying failed task-attempts by using information in the detailed job-traces. We also plan to build a simple statistical model of task failures which can then be used to simulate tasks which were never scheduled since the job failed early etc. Simulating Reduce Tasks In Mumak 1.0 we do not plan to simulate the running of the actual map/reduce tasks. Given that it is not possible to simulate the implicit dependency between completion of maps, the shuffle phase and the start of the reduce phase of the reduce tasks. Hence, we have decided to use a special AllMapsFinished event generated by the SimulatedJobTracker to trigger the start of the reduce-phase. For the same reasons, we have to model the total runtime of the reduce task as the summation of the time taken for completion of all maps and the time taken for individual task to complete the reduce-phase by itself. Thus, we are not going to try modelling the shuffle phase accurately. Furthermore, we will ignore map-task failures due to failed shuffles since we are not simulating the shuffle-phase. Thoughts?

            People

            • Assignee:
              Hong Tang
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              39 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development