Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2915

Enable YARN RM scale out via federation using multiple RM's

    XMLWordPrintableJSON

Details

    • Reviewed
    • Hide
      A federation-based approach to transparently scale a single YARN cluster to tens of thousands of nodes, by federating multiple YARN standalone clusters (sub-clusters). The applications running in this federated environment will see a single massive YARN cluster and will be able to schedule tasks on any node of the federated cluster. Under the hood, the federation system will negotiate with sub-clusters ResourceManagers and provide resources to the application. The goal is to allow an individual job to “span” sub-clusters seamlessly.
      Show
      A federation-based approach to transparently scale a single YARN cluster to tens of thousands of nodes, by federating multiple YARN standalone clusters (sub-clusters). The applications running in this federated environment will see a single massive YARN cluster and will be able to schedule tasks on any node of the federated cluster. Under the hood, the federation system will negotiate with sub-clusters ResourceManagers and provide resources to the application. The goal is to allow an individual job to “span” sub-clusters seamlessly.

    Description

      This is an umbrella JIRA that proposes to scale out YARN to support large clusters comprising of tens of thousands of nodes. That is, rather than limiting a YARN managed cluster to about 4k in size, the proposal is to enable the YARN managed cluster to be elastically scalable.

      Attachments

        1. YARN-Federation-Hadoop-Summit_final.pptx
          182 kB
          Subramaniam Krishnan
        2. Yarn_federation_design_v1.pdf
          787 kB
          Subramaniam Krishnan
        3. federation-prototype.patch
          729 kB
          Subramaniam Krishnan
        4. Federation-BoF.pdf
          909 kB
          Subramaniam Krishnan
        5. FEDERATION_CAPACITY_ALLOCATION_JIRA.pdf
          751 kB
          Carlo Curino

        Issue Links

          1.
          Federation Membership State Store internal APIs Sub-task Resolved Subramaniam Krishnan
          2.
          Federation State and Policy Store (DBMS implementation) Sub-task Resolved Giovanni Matteo Fumarola
          3.
          Federation PolicyStore internal APIs Sub-task Resolved Subramaniam Krishnan
          4.
          Federation subcluster membership mechanisms Sub-task Resolved Subramaniam Krishnan
          5.
          Federation Intercepting and propagating AM- home RM communications Sub-task Resolved Botong Huang
          6.
          Federation: transparently spanning application across multiple sub-clusters Sub-task Resolved Botong Huang
          7.
          Integrate Federation services with ResourceManager Sub-task Resolved Subramaniam Krishnan
          8.
          Create Facade for Federation State and Policy Store Sub-task Resolved Subramaniam Krishnan
          9.
          Create a FailoverProxy for Federation services Sub-task Resolved Subramaniam Krishnan
          10.
          Add a flag in container to indicate whether it's an AM container or not Sub-task Resolved Giovanni Matteo Fumarola
          11.
          Make the NodeManager's ContainerManager pluggable Sub-task Resolved Subramaniam Krishnan
          12.
          Exclude generated federation protobuf sources from YARN Javadoc/findbugs build Sub-task Resolved Subramaniam Krishnan
          13.
          Federation Application State Store internal APIs Sub-task Resolved Subramaniam Krishnan
          14.
          Policies APIs (for Router and AMRMProxy policies) Sub-task Resolved Carlo Curino
          15.
          Stateless Federation router policies implementation Sub-task Resolved Carlo Curino
          16.
          Stateless ARMRMProxy policies implementation Sub-task Resolved Carlo Curino
          17.
          PolicyManager to tie together Router/AMRM Federation policies Sub-task Resolved Carlo Curino
          18.
          Simplify initialization/use of RouterPolicy via a RouterPolicyFacade Sub-task Resolved Carlo Curino
          19.
          Federation Subcluster Resolver Sub-task Resolved Ellen Hui
          20.
          In-memory based implementation of the FederationMembershipStateStore Sub-task Resolved Ellen Hui
          21.
          In-memory based implementation of the FederationApplicationStateStore, FederationPolicyStateStore Sub-task Resolved Ellen Hui
          22.
          Compose Federation membership/application/policy APIs into an uber FederationStateStore API Sub-task Resolved Ellen Hui
          23.
          Bootstrap Router server module Sub-task Resolved Giovanni Matteo Fumarola
          24.
          Federation: routing client invocations transparently to multiple RMs Sub-task Resolved Giovanni Matteo Fumarola
          25.
          Create a proxy chain for ApplicationClientProtocol in the Router Sub-task Resolved Giovanni Matteo Fumarola
          26.
          Create a proxy chain for ResourceManager REST API in the Router Sub-task Resolved Giovanni Matteo Fumarola
          27.
          Create a proxy chain for ResourceManager Admin API in the Router Sub-task Resolved Giovanni Matteo Fumarola
          28.
          InputValidator for the FederationStateStore internal APIs Sub-task Resolved Giovanni Matteo Fumarola
          29.
          Add SubClusterId in AddApplicationHomeSubClusterResponse for Router Failover Sub-task Resolved Ellen Hui
          30.
          UnmanagedAM pool manager for federating application across clusters Sub-task Resolved Botong Huang
          31.
          Make the RM epoch base value configurable Sub-task Resolved Subramaniam Krishnan
          32.
          Utils for Federation State and Policy Store Sub-task Resolved Giovanni Matteo Fumarola
          33.
          Return SubClusterId in FederationStateStoreFacade#addApplicationHomeSubCluster for Router Failover Sub-task Resolved Giovanni Matteo Fumarola
          34.
          Add a HashBasedRouterPolicy, and small policies and test refactoring. Sub-task Resolved Carlo Curino
          35.
          Refactor TestPBImplRecords so that we can reuse for testing protocol records in other YARN modules Sub-task Resolved Subramaniam Krishnan
          36.
          Add AlwayReject policies for router and amrmproxy. Sub-task Resolved Carlo Curino
          37.
          Update the RM webapp host that is reported as part of Federation membership to current primary RM's IP Sub-task Resolved Subramaniam Krishnan
          38.
          Add support for work preserving NM restart when AMRMProxy is enabled Sub-task Resolved Botong Huang
          39.
          Validation and synchronization fixes in LocalityMulticastAMRMProxyPolicy Sub-task Resolved Botong Huang
          40.
          Occasional test failure in TestWeightedRandomRouterPolicy Sub-task Resolved Carlo Curino
          41.
          Fix minor bugs in handling of local AMRMToken in AMRMProxy Sub-task Resolved Botong Huang
          42.
          Support multiple attempts on the node when AMRMProxy is enabled Sub-task Resolved Giovanni Matteo Fumarola
          43.
          Share a single instance of SubClusterResolver instead of instantiating one per AM Sub-task Resolved Botong Huang
          44.
          Cleanup when AMRMProxy fails to initialize a new interceptor chain Sub-task Resolved Botong Huang
          45.
          Recreate interceptor chain for different attemptId in the same node in AMRMProxy Sub-task Resolved Botong Huang
          46.
          [Documentation] Documenting the YARN Federation feature Sub-task Resolved Carlo Curino
          47.
          [Regression] TestFederationRMStateStoreService is failing with null pointer exception Sub-task Resolved Subramaniam Krishnan
          48.
          Fix memory leak and finish app trigger in AMRMProxy Sub-task Resolved Botong Huang
          49.
          Refactor of ResourceManager#startWebApp in a Util class Sub-task Resolved Giovanni Matteo Fumarola
          50.
          Fix unit test failure in TestRouterClientRMService Sub-task Resolved Botong Huang
          51.
          Add ability to blacklist sub-clusters when invoking Routing policies Sub-task Resolved Giovanni Matteo Fumarola
          52.
          Adding required missing configs to Federation configuration guide based on e2e testing Sub-task Resolved Tanuj Nayak
          53.
          [Bug] FederationStateStoreFacade return behavior should be consistent irrespective of whether caching is enabled or not Sub-task Resolved Subramaniam Krishnan
          54.
          Move FederationStateStore SQL DDL files from test resource to sbin Sub-task Resolved Subramaniam Krishnan
          55.
          Minor clean-up and fixes in anticipation of YARN-2915 merge with trunk Sub-task Resolved Botong Huang
          56.
          Refactoring RMWebServices by moving some util methods to RMWebAppUtil Sub-task Resolved Giovanni Matteo Fumarola
          57.
          Add MySql Scripts for FederationStateStore Sub-task Resolved Giovanni Matteo Fumarola
          58.
          Update Microsoft JDBC Driver for SQL Server version in License.txt Sub-task Resolved Botong Huang
          59.
          Handle concurrent register AM requests in FederationInterceptor Sub-task Resolved Botong Huang
          60.
          Add PoolInitializationException as retriable exception in FederationFacade Sub-task Resolved Giovanni Matteo Fumarola
          61.
          Federation: routing REST invocations transparently to multiple RMs (part 1 - basic execution) Sub-task Resolved Giovanni Matteo Fumarola
          62.
          ZooKeeper based implementation of the FederationStateStore Sub-task Resolved Íñigo Goiri
          63.
          Metrics for Federation StateStore Sub-task Resolved Ellen Hui
          64.
          Metrics for Federation Router Sub-task Resolved Giovanni Matteo Fumarola
          65.
          Federation: routing REST invocations transparently to multiple RMs (part 2 - getApps) Sub-task Resolved Giovanni Matteo Fumarola
          66.
          Federation: routing getNode/getNodes/getMetrics REST invocations transparently to multiple RMs Sub-task Resolved Giovanni Matteo Fumarola
          67.
          Update YARN daemon startup/shutdown scripts to include Router service Sub-task Resolved Giovanni Matteo Fumarola
          68.
          Federation: routing ClientRM invocations transparently to multiple RMs (part 2 - getApps) Sub-task Resolved Giovanni Matteo Fumarola
          69.
          Federation: routing ClientRM invocations transparently to multiple RMs (part 5 - getNode/getNodes/getMetrics) Sub-task Resolved Giovanni Matteo Fumarola
          70.
          Add support for updateContainers when allocating using FederationInterceptor Sub-task Resolved Botong Huang
          71.
          Basic Federation UI Sub-task Resolved Íñigo Goiri

          Activity

            People

              subru Subramaniam Krishnan
              sriramsrao Sriram Rao
              Votes:
              2 Vote for this issue
              Watchers:
              87 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: