Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.23.0
    • Component/s: mrv2
    • Labels:
      None
    • Release Note:
      Hide
      MapReduce has undergone a complete re-haul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2).

      The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

      The ResourceManager has two main components:
      * Scheduler (S)
      * ApplicationsManager (ASM)

      The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc.

      The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.

      The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources.
      The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

      The NodeManager is the per-machine framework agent who is responsible for launching the applications' containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

      The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
      Show
      MapReduce has undergone a complete re-haul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2). The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. The ResourceManager has two main components: * Scheduler (S) * ApplicationsManager (ASM) The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc. The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in. The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources. The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. The NodeManager is the per-machine framework agent who is responsible for launching the applications' containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
    • Tags:
      mr2,mapreduce-2.0

      Description

      Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution.

      1. MR-279.patch
        3.66 MB
        Arun C Murthy
      2. MR-279.sh
        0.8 kB
        Arun C Murthy
      3. MR-279_MR_files_to_move.txt
        23 kB
        Arun C Murthy
      4. MR-279.patch
        3.94 MB
        Arun C Murthy
      5. capacity-scheduler-dark-theme.png
        192 kB
        Luke Lu
      6. multi-column-stable-sort-default-theme.png
        299 kB
        Luke Lu
      7. yarn-state-machine.job.dot
        2 kB
        Greg Roelofs
      8. yarn-state-machine.task-attempt.dot
        3 kB
        Greg Roelofs
      9. yarn-state-machine.task.dot
        2 kB
        Greg Roelofs
      10. yarn-state-machine.job.png
        23 kB
        Greg Roelofs
      11. yarn-state-machine.task-attempt.png
        25 kB
        Greg Roelofs
      12. yarn-state-machine.task.png
        18 kB
        Greg Roelofs
      13. hadoop_contributors_meet_07_01_2011.pdf
        531 kB
        Sharad Agarwal
      14. MapReduce_NextGen_Architecture.pdf
        554 kB
        Arun C Murthy
      15. MR-279-script.sh
        2 kB
        Mahadev konar
      16. MR-279_MR_files_to_move.txt
        23 kB
        Mahadev konar
      17. post-move.patch
        83 kB
        Mahadev konar
      18. post-move.patch
        85 kB
        Arun C Murthy
      19. MR-279-script.sh
        3 kB
        Arun C Murthy
      20. post-move.patch
        99 kB
        Arun C Murthy
      21. MR-279-script-20110817.sh
        3 kB
        Vinod Kumar Vavilapalli
      22. MR-279_MR_files_to_move-20110817.txt
        23 kB
        Vinod Kumar Vavilapalli
      23. post-move-patch-20110817.2.txt
        126 kB
        Vinod Kumar Vavilapalli
      24. post-move-patch-final.txt
        131 kB
        Mahadev konar
      25. MR-279-script-final.sh
        3 kB
        Arun C Murthy
      26. ResourceManager.png
        290 kB
        Binglin Chang
      27. ResourceManager.gv
        6 kB
        Binglin Chang
      28. NodeManager.gv
        6 kB
        Binglin Chang
      29. NodeManager.png
        228 kB
        Binglin Chang

        Issue Links

          Activity

          Gavin made changes -
          Link This issue is depended upon by MAPREDUCE-297 [ MAPREDUCE-297 ]
          Gavin made changes -
          Link This issue blocks MAPREDUCE-297 [ MAPREDUCE-297 ]
          Harsh J made changes -
          Link This issue incorporates MAPREDUCE-4345 [ MAPREDUCE-4345 ]
          Harsh J made changes -
          Link This issue is duplicated by HADOOP-3444 [ HADOOP-3444 ]
          Harsh J made changes -
          Link This issue is duplicated by HADOOP-3421 [ HADOOP-3421 ]
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Arun C Murthy made changes -
          Release Note MapReduce has undergone a complete re-haul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2).

          The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.



          The ResourceManager has two main components:
          * Scheduler (S)
          * ApplicationsManager (ASM)


          The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc.

          The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.

          The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

          The NodeManager is the per-machine framework agent who is responsible for launching the applications' containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

          The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
          MapReduce has undergone a complete re-haul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2).

          The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

          The ResourceManager has two main components:
          * Scheduler (S)
          * ApplicationsManager (ASM)

          The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc.

          The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.

          The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources.
          The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

          The NodeManager is the per-machine framework agent who is responsible for launching the applications' containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

          The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
          Arun C Murthy made changes -
          Release Note MapReduce has undergone a complete re-haul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2).

          The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.



          The ResourceManager has two main components:
          * Scheduler (S)
          * ApplicationsManager (ASM)


          The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc.

          The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.

          The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

          The NodeManager is the per-machine framework agent who is responsible for launching the applications' containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

          The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
          Description Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution.

          Check it out by following [the instructions|http://goo.gl/rSJJC].
          Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution.

          Binglin Chang made changes -
          Attachment NodeManager.gv [ 12492877 ]
          Attachment NodeManager.png [ 12492878 ]
          Binglin Chang made changes -
          Attachment ResourceManager.png [ 12492875 ]
          Attachment ResourceManager.gv [ 12492876 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to PIG-2125 [ PIG-2125 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to MAPREDUCE-2720 [ MAPREDUCE-2720 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to MAPREDUCE-2719 [ MAPREDUCE-2719 ]
          Vinod Kumar Vavilapalli made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Arun C Murthy made changes -
          Attachment MR-279-script-final.sh [ 12490713 ]
          Mahadev konar made changes -
          Attachment post-move-patch-final.txt [ 12490696 ]
          Vinod Kumar Vavilapalli made changes -
          Attachment MR-279-script-20110817.sh [ 12490651 ]
          Attachment MR-279_MR_files_to_move-20110817.txt [ 12490652 ]
          Attachment post-move-patch-20110817.2.txt [ 12490653 ]
          Arun C Murthy made changes -
          Attachment post-move.patch [ 12490592 ]
          Arun C Murthy made changes -
          Attachment post-move.patch [ 12490588 ]
          Attachment MR-279-script.sh [ 12490589 ]
          Arun C Murthy made changes -
          Assignee Arun C Murthy [ acmurthy ]
          Mahadev konar made changes -
          Attachment MR-279-script.sh [ 12490575 ]
          Attachment MR-279_MR_files_to_move.txt [ 12490576 ]
          Attachment post-move.patch [ 12490577 ]
          Arun C Murthy made changes -
          Attachment MapReduce_NextGen_Architecture.pdf [ 12486023 ]
          Josh Wills made changes -
          Link This issue relates to MAPREDUCE-2644 [ MAPREDUCE-2644 ]
          Sharad Agarwal made changes -
          Attachment hadoop_contributors_meet_07_01_2011.pdf [ 12485267 ]
          Josh Wills made changes -
          Link This issue relates to MAPREDUCE-2639 [ MAPREDUCE-2639 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to MAPREDUCE-2630 [ MAPREDUCE-2630 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to MAPREDUCE-2633 [ MAPREDUCE-2633 ]
          Luke Lu made changes -
          Tags mr2,mapreduce-2.0
          Description Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution.

          Check it out by following [the instructions|http://goo.gl/rSJJC].
          Component/s mrv2 [ 12314301 ]
          Component/s tasktracker [ 12312906 ]
          Component/s jobtracker [ 12312907 ]
          Greg Roelofs made changes -
          Attachment yarn-state-machine.job.dot [ 12476505 ]
          Attachment yarn-state-machine.task-attempt.dot [ 12476506 ]
          Attachment yarn-state-machine.task.dot [ 12476507 ]
          Attachment yarn-state-machine.job.png [ 12476508 ]
          Attachment yarn-state-machine.task-attempt.png [ 12476509 ]
          Attachment yarn-state-machine.task.png [ 12476510 ]
          Luke Lu made changes -
          Attachment capacity-scheduler-dark-theme.png [ 12474225 ]
          Attachment multi-column-stable-sort-default-theme.png [ 12474226 ]
          Eli Collins made changes -
          Comment [ I'm traveling and will return to the office on Monday, March 28th.

          For urgent matters, please contact Aparna Ramani.

          Thanks!

          -- Philip
          ]
          Arun C Murthy made changes -
          Attachment MR-279.patch [ 12473927 ]
          Owen O'Malley made changes -
          Comment [ Hi Folks,

          I'm back part-time, but I'm mainly focused on catching up, annual focal reviews and adjusting to life with a newborn at home.

          Todd Papaioannou (p9u) remains acting head of Hadoop this week.

          Most line issues can continue to go to Amol, Kazi, Satish, Avik or Senthil as appropriate.

          I am about, drop me a line on my personal email or call my cell if you need rapid response, but I am reading mail now.

          CUSoon,
          E14
          ]
          Arun C Murthy made changes -
          Attachment MR-279.patch [ 12473869 ]
          Attachment MR-279.sh [ 12473870 ]
          Attachment MR-279_MR_files_to_move.txt [ 12473871 ]
          Owen O'Malley made changes -
          Comment [ Am out of office and will return on March 2 2011.
          ]
          Owen O'Malley made changes -
          Comment [ I'm way out of the office, I'm helping with the newest addition to our family, Jack baldeschwieler Yoshikawa.

          Todd Papaioannou (p9u) is action head of Hadoop.
          Most line issues can continue to go to Amol, Kazi, Satish, Avik or Senthil as appropriate.

          I'll be back on roughly march 9th.

          CUSoon,
          E14
          ]
          Owen O'Malley made changes -
          Comment [ h5. Proposal

          The fundamental idea of the re-factor is to divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components: a generic resource scheduler and a per-job, user-defined component that manages the application execution.

          The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application's scheduling and coordination. An application is either a single job in the classic MapReduce jobs or a DAG of such jobs. The ResourceManager and per-machine NodeManager server, which manages the user processes on that machine, form the computation fabric. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

          The ResourceManager is a pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures.

          The ResourceManager performs its scheduling function based the resource requirements of the applications; each application has multiple resource request types that represent the resources required for containers. The resource requests include memory, CPU, disk, network etc. Note that this is a significant change from the current model of fixed-type slots in Hadoop MapReduce, which leads to significant negative impact on cluster utilization. The ResourceManager has a scheduler policy plug-in, which is responsible for partitioning the cluster resources among various queues, applications etc. Scheduler plug-ins can be based, for e.g., on the current CapacityScheduler and FairScheduler.

          The NodeManager is the per-machine framework agent who is responsible for launching the applications' containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.

          The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, launching tasks, tracking their status & monitoring for progress, handling task-failures and recovering from saved state on an ResourceManager fail-over.

          Since downtime is more expensive at scale high-availability is built-in from the beginning via Apache ZooKeeper for the ResourceManager and HDFS checkpoint for the MapReduce ApplicationMaster. Security and multi-tenancy support is critical to support many users on the larger clusters. The new architecture will also increase innovation and agility by allowing for user-defined versions of MapReduce runtime. Support for generic resource requests will increase cluster utilization by removing artificial bottlenecks such as hard-partitioning of resources into map and reduce slots.

          ----

          We have a *prototype* we'd like to commit to a branch soon, where we look forward to feedback. From there on, we would love to collaborate to get it committed to trunk.

          ]
          Owen O'Malley made changes -
          Comment [ We're having a baby!

          Todd Papaioannou (p9u) is action head of Hadoop.
          Most line issues can continue to go to Amol, Kazi, Satish, Avik or Senthil as appropriate.

          I'll be back on roughly march 9th.

          CUSoon,
          E14
          ]
          Arun C Murthy made changes -
          Description We, at Yahoo!, have been using Hadoop-On-Demand as the resource provisioning/scheduling mechanism.

          With HoD the user uses a self-service system to ask-for a set of nodes. HoD allocates these from a global pool and also provisions a private Map-Reduce cluster for the user. She then runs her jobs and shuts the cluster down via HoD when done. All user-private clusters use the same humongous, static HDFS (e.g. 2k node HDFS).

          More details about HoD are available here: HADOOP-1301.

          ----

          h3. Motivation

          The current deployment (Hadoop + HoD) has a couple of implications:

           * _Non-optimal Cluster Utilization_

             1. Job-private Map-Reduce clusters imply that the user-cluster potentially could be *idle* for atleast a while before being detected and shut-down.

             2. Elastic Jobs: Map-Reduce jobs, typically, have lots of maps with much-smaller no. of reduces; with maps being light and quick and reduces being i/o heavy and longer-running. Users typically allocate clusters depending on the no. of maps (i.e. input size) which leads to the scenario where all the maps are done (idle nodes in the cluster) and the few reduces are chugging along. Right now, we do not have the ability to shrink the HoD'ed Map-Reduce clusters which would alleviate this issue.

           * _Impact on data-locality_

          With the current setup of a static, large HDFS and much smaller (5/10/20/50 node) clusters there is a good chance of losing one of Map-Reduce's primary features: ability to execute tasks on the datanodes where the input splits are located. In fact, we have seen the data-local tasks go down to 20-25 percent in the GridMix benchmarks, from the 95-98 percent we see on the randomwriter+sort runs run as part of the hadoopqa benchmarks (admittedly a synthetic benchmark, but yet). Admittedly, HADOOP-1985 (rack-aware Map-Reduce) helps significantly here.

          ----

          Primarily, the notion of *job-level scheduling* leading to private clusers, as opposed to *task-level scheduling*, is a good peg to hang-on the majority of the blame.

          Keeping the above factors in mind, here are some thoughts on how to re-structure Hadoop Map-Reduce to solve some of these issues.

          ----

          h3. State of the Art

          As it exists today, a large, static, Hadoop Map-Reduce cluster (forget HoD for a bit) does provide task-level scheduling; however as it exists today, it's scalability to tens-of-thousands of user-jobs, per-week, is in question.

          Lets review it's current architecture and main components:

           * JobTracker: It does both *task-scheduling* and *task-monitoring* (tasktrackers send task-statuses via periodic heartbeats), which implies it is fairly loaded. It is also a _single-point of failure_ in the Map-Reduce framework i.e. its failure implies that all the jobs in the system fail. This means a static, large Map-Reduce cluster is fairly susceptible and a definite suspect. Clearly HoD solves this by having per-job clusters, albeit with the above drawbacks.
           * TaskTracker: The slave in the system which executes one task at-a-time under directions from the JobTracker.
           * JobClient: The per-job client which just submits the job and polls the JobTracker for status.

          ----

          h3. Proposal - Map-Reduce 2.0

          The primary idea is to move to task-level scheduling and static Map-Reduce clusters (so as to maintain the same storage cluster and compute cluster paradigm) as a way to directly tackle the two main issues illustrated above. Clearly, we will have to get around the existing problems, especially w.r.t. scalability and reliability.

          The proposal is to re-work Hadoop Map-Reduce to make it suitable for a large, static cluster.

          Here is an overview of how its main components would look like:
           * JobTracker: Turn the JobTracker into a pure task-scheduler, a global one. Lets call this the *JobScheduler* henceforth. Clearly (data-locality aware) Maui/Moab are candidates for being the scheduler, in which case, the JobScheduler is just a thin wrapper around them.
           * TaskTracker: These stay as before, without some minor changes as illustrated later in the piece.
           * JobClient: Fatten up the JobClient my putting a lot more intelligence into it. Enhance it to talk to the JobTracker to ask for available TaskTrackers and then contact them to schedule and monitor the tasks. So we'll have lots of per-job clients talking to the JobScheduler and the relevant TaskTrackers for their respective jobs, a big change from today. Lets call this the *JobManager* henceforth.

          A broad sketch of how things would work:

          h4. Deployment

          There is a single, static, large Map-Reduce cluster, and no per-job clusters.

          Essentially there is one global JobScheduler with thousands of independent TaskTrackers, each running on one node.

          As mentioned previously, the JobScheduler is a pure task-scheduler. When contacted by per-job JobManagers querying for TaskTrackers to run their tasks on, the JobTracker takes into the account the job priority, data-placements (HDFS blocks), current-load/capacity of the TaskTrackers and gives the JobManager a free slot for the task(s) in question, if available.

          Each TaskTracker periodically updates the master JobScheduler with information about the currently running tasks and available free-slots. It waits for the per-job JobManager to contact it for free-slots (which abide the JobScheduler's directives) and status for currently-running tasks (of course, the JobManager knows exactly which TaskTrackers it needs to talk to).

          The fact that the JobScheduler is no longer doing the heavy-lifting of monitoring tasks (like the current JobTracker), and hence the jobs, is the key differentiator, which is why it should be very light-weight. (Thus, it is even conceivable to imagine a hot-backup of the JobScheduler, topic for another discussion.)

          h4. Job Execution

          Here is how the job-execution work-flow looks like:

              * User submits a job,
              * The JobClient, as today, validates inputs, computes the input splits etc.
              * Rather than submit the job to the JobTracker which then runs it, the JobClient now dons the role of the JobManager as described above (of course they could be two independent processes working in conjunction with the other... ). The JobManager pro-actively works with the JobScheduler and the TaskTrackers to execute the job. While there are more tasks to run for the still-running job, it contacts the JobScheduler to get 'n' free slots and schedules m tasks (m <= n) on the given TaskTrackers (slots). The JobManager also monitors the tasks by contacting the relevant TaskTrackers (it knows which of the TaskTrackers are running its tasks).

          h4. Brownie Points

           * With Map-Reduce v2.0, we get reliability/scalability of the current (Map-Reduce + HoD) architecture.
           * We get elastic jobs for free since there is no concept of private clusters and clearly JobManagers do not need to hold on to the map-nodes when they are done.
           * We do get data-locality across all jobs, big or small, since there are no off-limit DataNodes (i.e. DataNodes outside the private cluster) for a Map-Reduce cluster, as today.
           * From an architectural standpoint, each component in the system (sans the global scheduler) is nicely independent and impervious of the other:
            ** A JobManager is responsible for one and only one job, loss of a JobManager affects only one job.
            ** A TaskTracker manages only one node, it's loss affects only one node in the cluster.
            ** No user-code runs in the JobScheduler since it's a pure scheduler.
           * We can run all of the user-code (input/output formats, split calculation, task-output promotion etc.) from the JobManager since it is, by definition, the user-client.

          h4. Points to Ponder

           * Given that the JobScheduler, is very light-weight, could we have a hot-backup for HA?
           * Discuss the notion of a rack-level aggregator of TaskTracker statuses i.e. rather than have every TaskTracker update the JobScheduler, a rack-level aggregator could achieve the same?
           * We could have the notion of a JobManager being the proxy process running inside the cluster for the JobClient (the job-submitting program which is running outside the colo e.g. user's dev box) ... in fact we can think of the JobManager being *another kind of task* which needs to be scheduled to run at a TaskTracker.
           * Task Isolation via separate vms (vmware/xen) rather than just separate jvms?

          h4. How do we get to Map-Reduce 2.0?

          At the risk of sounding hopelessly optimistic, we probably do not have to work too much to get here.

           * Clearly the main changes come in the JobTracker/JobClient where we _move_ the pieces which monitor the job's tasks' progress into the JobScheduler/JobManager.
           * We also need to enhance the JobClient (as the JobManager) to get it to talk to the JobTracker (JobScheduler) to query for the empty slots, which might not be available!
           * Then we need to add RPCs to get the JobClient (JobManager) to talk to the given TaskTrackers to get them to run the tasks, thus reversing the direction of current RPCs needed to start a task (now the TaskTracker asks the JobTracker for tasks to run); we also need new RPCs for the JobClient (JobManager) to talk to the TaskTracker to query it's tasks' statuses.
           * We leave the current heartbeat mechanism from the TaskTracker to the JobTracker (JobScheduler) as-is, sans the task-statuses.

          h4. Glossary

           * JobScheduler - The global, task-scheduler which is today's JobTracker minus the code for tracking/monitoring jobs and their tasks. A pure scheduler.
           * JobManager - The per-job manager which is wholly responsible for working with the JobScheduler and TaskTrackers to schedule it's tasks and track their progress till job-completion (success/failure). Simplistically it is the current JobClient plus the enhancements to enable it to talk to the JobScheduler and TaskTrackers for running/monitoring the tasks.

          ----

          h3. Tickets for the Gravy-Train ride

          Eric has started a discussion about generalizing Hadoop to support non-MR tasks, a discussion which has surfaced a few times on our lists, at HADOOP-2491.

          He notes:

          {quote}
          Our primary goal in going this way would be to get better utilization out of map-reduce clusters and support a richer scheduling model. The ability to support alternative job frameworks would just be gravy!

          Putting this in as a place holder. Hope to get folks talking about this to post some more detail.
          {quote}

          This is the start of the path to the promised gravy-land. *smile*

          We believe Map-Reduce 2.0 is a good start in moving most (if not all) of the Map-Reduce specific code into the user-clients (i.e. JobManager) and taking a shot at generalizing the JobTracker (as the JobScheduler) and the TaskTracker to handle more generic tasks via different (smarter/dumber) user-clients.

          ----

          Thoughts?
          Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution.
          Arun C Murthy made changes -
          Assignee Arun C Murthy [ acmurthy ]
          Fix Version/s 0.23.0 [ 12315570 ]
          Component/s jobtracker [ 12312907 ]
          Component/s tasktracker [ 12312906 ]
          Owen O'Malley made changes -
          Project Hadoop Common [ 12310240 ] Hadoop Map/Reduce [ 12310941 ]
          Key HADOOP-2510 MAPREDUCE-279
          Component/s mapred [ 12310690 ]
          Arun C Murthy made changes -
          Field Original Value New Value
          Link This issue blocks HADOOP-2491 [ HADOOP-2491 ]
          Arun C Murthy created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Arun C Murthy
            • Votes:
              6 Vote for this issue
              Watchers:
              108 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development