Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2915

Enable YARN RM scale out via federation using multiple RM's

    Details

    • Hadoop Flags:
      Reviewed

      Description

      This is an umbrella JIRA that proposes to scale out YARN to support large clusters comprising of tens of thousands of nodes. That is, rather than limiting a YARN managed cluster to about 4k in size, the proposal is to enable the YARN managed cluster to be elastically scalable.

      1. FEDERATION_CAPACITY_ALLOCATION_JIRA.pdf
        751 kB
        Carlo Curino
      2. Federation-BoF.pdf
        909 kB
        Subru Krishnan
      3. federation-prototype.patch
        729 kB
        Subru Krishnan
      4. Yarn_federation_design_v1.pdf
        787 kB
        Subru Krishnan
      5. YARN-Federation-Hadoop-Summit_final.pptx
        182 kB
        Subru Krishnan

        Issue Links

        1.
        Federation Membership State Store internal APIs Sub-task Resolved Subru Krishnan
         
        2.
        Federation State and Policy Store (DBMS implementation) Sub-task Resolved Giovanni Matteo Fumarola
         
        3.
        Federation PolicyStore internal APIs Sub-task Resolved Subru Krishnan
         
        4.
        Federation subcluster membership mechanisms Sub-task Resolved Subru Krishnan
         
        5.
        Federation Intercepting and propagating AM- home RM communications Sub-task Resolved Botong Huang
         
        6.
        Federation: transparently spanning application across multiple sub-clusters Sub-task Resolved Botong Huang
         
        7.
        Integrate Federation services with ResourceManager Sub-task Resolved Subru Krishnan
         
        8.
        Create Facade for Federation State and Policy Store Sub-task Resolved Subru Krishnan
         
        9.
        Create a FailoverProxy for Federation services Sub-task Resolved Subru Krishnan
         
        10.
        Add a flag in container to indicate whether it's an AM container or not Sub-task Resolved Giovanni Matteo Fumarola
         
        11.
        Make the NodeManager's ContainerManager pluggable Sub-task Resolved Subru Krishnan
         
        12.
        Exclude generated federation protobuf sources from YARN Javadoc/findbugs build Sub-task Resolved Subru Krishnan
         
        13.
        Federation Application State Store internal APIs Sub-task Resolved Subru Krishnan
         
        14.
        Policies APIs (for Router and AMRMProxy policies) Sub-task Resolved Carlo Curino
         
        15.
        Stateless Federation router policies implementation Sub-task Resolved Carlo Curino
         
        16.
        Stateless ARMRMProxy policies implementation Sub-task Resolved Carlo Curino
         
        17.
        PolicyManager to tie together Router/AMRM Federation policies Sub-task Resolved Carlo Curino
         
        18.
        Simplify initialization/use of RouterPolicy via a RouterPolicyFacade Sub-task Resolved Carlo Curino
         
        19.
        Federation Subcluster Resolver Sub-task Resolved Ellen Hui
         
        20.
        In-memory based implementation of the FederationMembershipStateStore Sub-task Resolved Ellen Hui
         
        21.
        In-memory based implementation of the FederationApplicationStateStore, FederationPolicyStateStore Sub-task Resolved Ellen Hui
         
        22.
        Compose Federation membership/application/policy APIs into an uber FederationStateStore API Sub-task Resolved Ellen Hui
         
        23.
        Bootstrap Router server module Sub-task Resolved Giovanni Matteo Fumarola
         
        24.
        Federation: routing client invocations transparently to multiple RMs Sub-task Resolved Giovanni Matteo Fumarola
         
        25.
        Create a proxy chain for ApplicationClientProtocol in the Router Sub-task Resolved Giovanni Matteo Fumarola
         
        26.
        Create a proxy chain for ResourceManager REST API in the Router Sub-task Resolved Giovanni Matteo Fumarola
         
        27.
        Create a proxy chain for ResourceManager Admin API in the Router Sub-task Resolved Giovanni Matteo Fumarola
         
        28.
        InputValidator for the FederationStateStore internal APIs Sub-task Resolved Giovanni Matteo Fumarola
         
        29.
        Add SubClusterId in AddApplicationHomeSubClusterResponse for Router Failover Sub-task Resolved Ellen Hui
         
        30.
        UnmanagedAM pool manager for federating application across clusters Sub-task Resolved Botong Huang
         
        31.
        Make the RM epoch base value configurable Sub-task Resolved Subru Krishnan
         
        32.
        Utils for Federation State and Policy Store Sub-task Resolved Giovanni Matteo Fumarola
         
        33.
        Return SubClusterId in FederationStateStoreFacade#addApplicationHomeSubCluster for Router Failover Sub-task Resolved Giovanni Matteo Fumarola
         
        34.
        Add a HashBasedRouterPolicy, and small policies and test refactoring. Sub-task Resolved Carlo Curino
         
        35.
        Refactor TestPBImplRecords so that we can reuse for testing protocol records in other YARN modules Sub-task Resolved Subru Krishnan
         
        36.
        Add AlwayReject policies for router and amrmproxy. Sub-task Resolved Carlo Curino
         
        37.
        Update the RM webapp host that is reported as part of Federation membership to current primary RM's IP Sub-task Resolved Subru Krishnan
         
        38.
        Add support for work preserving NM restart when AMRMProxy is enabled Sub-task Resolved Botong Huang
         
        39.
        Validation and synchronization fixes in LocalityMulticastAMRMProxyPolicy Sub-task Resolved Botong Huang
         
        40.
        Occasional test failure in TestWeightedRandomRouterPolicy Sub-task Resolved Carlo Curino
         
        41.
        Fix minor bugs in handling of local AMRMToken in AMRMProxy Sub-task Resolved Botong Huang
         
        42.
        Support multiple attempts on the node when AMRMProxy is enabled Sub-task Resolved Giovanni Matteo Fumarola
         
        43.
        Share a single instance of SubClusterResolver instead of instantiating one per AM Sub-task Resolved Botong Huang
         
        44.
        Cleanup when AMRMProxy fails to initialize a new interceptor chain Sub-task Resolved Botong Huang
         
        45.
        Recreate interceptor chain for different attemptId in the same node in AMRMProxy Sub-task Resolved Botong Huang
         
        46.
        [Documentation] Documenting the YARN Federation feature Sub-task Resolved Carlo Curino
         
        47.
        [Regression] TestFederationRMStateStoreService is failing with null pointer exception Sub-task Resolved Subru Krishnan
         
        48.
        Fix memory leak and finish app trigger in AMRMProxy Sub-task Resolved Botong Huang
         
        49.
        Refactor of ResourceManager#startWebApp in a Util class Sub-task Resolved Giovanni Matteo Fumarola
         
        50.
        Fix unit test failure in TestRouterClientRMService Sub-task Resolved Botong Huang
         
        51.
        Add ability to blacklist sub-clusters when invoking Routing policies Sub-task Resolved Giovanni Matteo Fumarola
         
        52.
        Adding required missing configs to Federation configuration guide based on e2e testing Sub-task Resolved Tanuj Nayak
         
        53.
        [Bug] FederationStateStoreFacade return behavior should be consistent irrespective of whether caching is enabled or not Sub-task Resolved Subru Krishnan
         
        54.
        Move FederationStateStore SQL DDL files from test resource to sbin Sub-task Resolved Subru Krishnan
         
        55.
        Minor clean-up and fixes in anticipation of YARN-2915 merge with trunk Sub-task Resolved Botong Huang
         
        56.
        Refactoring RMWebServices by moving some util methods to RMWebAppUtil Sub-task Resolved Giovanni Matteo Fumarola
         
        57.
        Add MySql Scripts for FederationStateStore Sub-task Resolved Giovanni Matteo Fumarola
         
        58.
        Update Microsoft JDBC Driver for SQL Server version in License.txt Sub-task Resolved Botong Huang
         
        59.
        Handle concurrent register AM requests in FederationInterceptor Sub-task Resolved Botong Huang
         
        60.
        Add PoolInitializationException as retriable exception in FederationFacade Sub-task Resolved Giovanni Matteo Fumarola
         
        61.
        Federation: routing REST invocations transparently to multiple RMs (part 1 - basic execution) Sub-task Resolved Giovanni Matteo Fumarola
         
        62.
        ZooKeeper based implementation of the FederationStateStore Sub-task Resolved Íñigo Goiri
         
        63.
        Metrics for Federation StateStore Sub-task Resolved Ellen Hui
         
        64.
        Metrics for Federation Router Sub-task Resolved Giovanni Matteo Fumarola
         
        65.
        Federation: routing REST invocations transparently to multiple RMs (part 2 - getApps) Sub-task Resolved Giovanni Matteo Fumarola
         
        66.
        Federation: routing getNode/getNodes/getMetrics REST invocations transparently to multiple RMs Sub-task Resolved Giovanni Matteo Fumarola
         
        67.
        Update YARN daemon startup/shutdown scripts to include Router service Sub-task Resolved Giovanni Matteo Fumarola
         
        68.
        Federation: routing ClientRM invocations transparently to multiple RMs (part 2 - getApps) Sub-task Resolved Giovanni Matteo Fumarola
         
        69.
        Federation: routing ClientRM invocations transparently to multiple RMs (part 5 - getNode/getNodes/getMetrics) Sub-task Resolved Giovanni Matteo Fumarola
         
        70.
        Add support for updateContainers when allocating using FederationInterceptor Sub-task Resolved Botong Huang
         

          Activity

          Hide
          subru Subru Krishnan added a comment -

          The proposed architecture leverages the notion of “federating” a number of standalone YARN clusters, each with its own RM and set of NMs, in a larger federated cluster. The federated cluster appears as a single giant cluster to the user with a single public end-point (aka router) that confirms to the YARN application client protocol. The individual RMs will participate as members of the federated cluster by heart beating their capabilities, enabling dynamic self-configuration of the federated cluster. The applications running in this environment will have access to the entire federated YARN cluster and will be able to schedule tasks on any node of the federated cluster. We plan to extend the distributed scheduling proposed in YARN-2877, specifically YARN-2884 to allow AMs to request containers across the federated cluster.

          Show
          subru Subru Krishnan added a comment - The proposed architecture leverages the notion of “federating” a number of standalone YARN clusters, each with its own RM and set of NMs, in a larger federated cluster. The federated cluster appears as a single giant cluster to the user with a single public end-point (aka router) that confirms to the YARN application client protocol. The individual RMs will participate as members of the federated cluster by heart beating their capabilities, enabling dynamic self-configuration of the federated cluster. The applications running in this environment will have access to the entire federated YARN cluster and will be able to schedule tasks on any node of the federated cluster. We plan to extend the distributed scheduling proposed in YARN-2877 , specifically YARN-2884 to allow AMs to request containers across the federated cluster.
          Hide
          subru Subru Krishnan added a comment -

          The above proposed mechanism in combination with YARN-2884 additionally enables the private cloud scenario. Suppose an enterprise has multiple smaller disparate YARN clusters, the federation architecture allows them to pool the individual clusters to create a single large cluster than can be used by the entire organization.

          Show
          subru Subru Krishnan added a comment - The above proposed mechanism in combination with YARN-2884 additionally enables the private cloud scenario. Suppose an enterprise has multiple smaller disparate YARN clusters, the federation architecture allows them to pool the individual clusters to create a single large cluster than can be used by the entire organization.
          Hide
          airbots Chen He added a comment -

          Hi Subru Krishnan, this is an interesting idea. It may make the task-scheduling across data centers possible. Look forward to see more details.

          Show
          airbots Chen He added a comment - Hi Subru Krishnan , this is an interesting idea. It may make the task-scheduling across data centers possible. Look forward to see more details.
          Hide
          curino Carlo Curino added a comment -

          A couple of design principles at play:

          1. We are designing federation so that it requires minimal changes to YARN.
          2. We are trying hard to make federation completely transparent to applications.
          3. We are investigating uses of federation that could facilitate maintenance / fault-tolerance / sub-cluster customization.

          Regarding (1), in the context of cluster pooling / private cloud idea mentioned above,
          the clusters being pooled can be (largely) unaware of the fact that are being federated together,
          as all/most public protocols are unmodified, and the AMRMProxy of YARN-2884 can be run only
          on a small cluster that work as a launch pad for the federation.

          Regarding (2), it seems plausible (we have a working prototype) to make the federation transparent to the applications,
          but more analysis of security, load balancing, and HA aspects is required.

          Regarding (3) federation should facilitate upgrades of each sub-cluster, can be made more fault-tolerant,
          by having the routing layer to fall-back on a secondary clusters upon cluster-wide failures, and could be leveraged
          for customization (e.g., run a smaller cluster with very fast heartbeats and a bigger cluster with slower heartbeat, and
          pull them together on demand).

          Show
          curino Carlo Curino added a comment - A couple of design principles at play: We are designing federation so that it requires minimal changes to YARN. We are trying hard to make federation completely transparent to applications. We are investigating uses of federation that could facilitate maintenance / fault-tolerance / sub-cluster customization. Regarding (1), in the context of cluster pooling / private cloud idea mentioned above, the clusters being pooled can be (largely) unaware of the fact that are being federated together, as all/most public protocols are unmodified, and the AMRMProxy of YARN-2884 can be run only on a small cluster that work as a launch pad for the federation. Regarding (2), it seems plausible (we have a working prototype) to make the federation transparent to the applications, but more analysis of security, load balancing, and HA aspects is required. Regarding (3) federation should facilitate upgrades of each sub-cluster, can be made more fault-tolerant, by having the routing layer to fall-back on a secondary clusters upon cluster-wide failures, and could be leveraged for customization (e.g., run a smaller cluster with very fast heartbeats and a bigger cluster with slower heartbeat, and pull them together on demand).
          Hide
          subru Subru Krishnan added a comment -

          Uploading a proposal of design based on offline design discussions within our team and with Karthik Kambatla, Anubhav Dhoot, Vinod Kumar Vavilapalli, Arun C Murthy, Alejandro Abdelnur and more people (apologize if I missed anyone). We validated the proposed design by developing a prototype and we have a basic end2end functioning system where we can stitch multiple YARN clusters into a unified federated cluster and run jobs that transparently span across all of them.

          Show
          subru Subru Krishnan added a comment - Uploading a proposal of design based on offline design discussions within our team and with Karthik Kambatla , Anubhav Dhoot , Vinod Kumar Vavilapalli , Arun C Murthy , Alejandro Abdelnur and more people (apologize if I missed anyone). We validated the proposed design by developing a prototype and we have a basic end2end functioning system where we can stitch multiple YARN clusters into a unified federated cluster and run jobs that transparently span across all of them.
          Hide
          grey Lei Guo added a comment -

          Does this design also consider multiple data center case? Though the idea is relatively generic, could be applied in either single data center or multiple data center, but in real production environment, we have to consider network latency and data locality across data center. The central state store could be a new bottleneck.

          Show
          grey Lei Guo added a comment - Does this design also consider multiple data center case? Though the idea is relatively generic, could be applied in either single data center or multiple data center, but in real production environment, we have to consider network latency and data locality across data center. The central state store could be a new bottleneck.
          Hide
          curino Carlo Curino added a comment -

          Lei Guo that is a good questions. We are not designing primarily for cross-DC scenarios but we are of course wondering about whether this approach can handle the latencies/partitioning/general-drama that comes with cross-DC deployment settings. The design is somewhat tolerant to the StateStore being unavailable, but cross-DC settings might stress this beyond its comfort zone. We are planning to run tests for this in the coming weeks/months, and we will report about our findings.

          Show
          curino Carlo Curino added a comment - Lei Guo that is a good questions. We are not designing primarily for cross-DC scenarios but we are of course wondering about whether this approach can handle the latencies/partitioning/general-drama that comes with cross-DC deployment settings. The design is somewhat tolerant to the StateStore being unavailable, but cross-DC settings might stress this beyond its comfort zone. We are planning to run tests for this in the coming weeks/months, and we will report about our findings.
          Hide
          curino Carlo Curino added a comment -

          I created a branch: YARN-2915 which we will use for the development of this feature.

          Show
          curino Carlo Curino added a comment - I created a branch: YARN-2915 which we will use for the development of this feature.
          Hide
          curino Carlo Curino added a comment -


          ENFORCING GLOBAL INVARIANT

          During the bird of a feather at Hadoop Summit 2015, and in separate conversations with Karthik Kambatla, Wangda Tan, Jian He, Vinod Kumar Vavilapalli, we received multiple questions on how we plan to handle global scheduler invariants with the local enforcement provided by the sub-cluster RMs.

          The attached FEDERATION_CAPACITY_ALLOCATION_JIRA.pdf is a short presentation that explains in more details our ideas.

          The key intuition is that we will have a spectrum of options ranging from full-replication of the queue structure in each sub-cluster to a full partitioning of it. On one extreme we will have a the best spreading of load and best fairness, while on the opposite extreme we will get the best scalability and isolation among tenants. Navigating the middle ground requires dynamic algorithms that continuously re-balance the queue mappings. Conceptually the problem is very close to preemption for node-labels when we allow rich expression and preferences on node labels.

          We propose an initial simple approach (re-using some of the preemption work to detect global imbalancing), and we are considering an LP-based modeling of the problem (possibly leveraging the apache-licensed solver in google or-tools).
          The solution we propose has the potential to provide a simple concrete initial version (which is likely to scale substantially), that we can iterate on getting better and better on it. Much of this must be driven by experimental results based on our initial prototype (which we are about to post code for).

          Show
          curino Carlo Curino added a comment - ENFORCING GLOBAL INVARIANT During the bird of a feather at Hadoop Summit 2015, and in separate conversations with Karthik Kambatla , Wangda Tan , Jian He , Vinod Kumar Vavilapalli , we received multiple questions on how we plan to handle global scheduler invariants with the local enforcement provided by the sub-cluster RMs. The attached FEDERATION_CAPACITY_ALLOCATION_JIRA.pdf is a short presentation that explains in more details our ideas. The key intuition is that we will have a spectrum of options ranging from full-replication of the queue structure in each sub-cluster to a full partitioning of it. On one extreme we will have a the best spreading of load and best fairness, while on the opposite extreme we will get the best scalability and isolation among tenants. Navigating the middle ground requires dynamic algorithms that continuously re-balance the queue mappings. Conceptually the problem is very close to preemption for node-labels when we allow rich expression and preferences on node labels. We propose an initial simple approach (re-using some of the preemption work to detect global imbalancing), and we are considering an LP-based modeling of the problem (possibly leveraging the apache-licensed solver in google or-tools). The solution we propose has the potential to provide a simple concrete initial version (which is likely to scale substantially), that we can iterate on getting better and better on it. Much of this must be driven by experimental results based on our initial prototype (which we are about to post code for).
          Hide
          subru Subru Krishnan added a comment -

          Lei Guo Thanks for taking a look at our design proposal and your pertinent question. Adding to Carlo Curino's answer while we are planning to evaluate cross-DC workloads as the architecture is generic for the control flow (job scheduling) as you rightly observed, the data movement across DCs itself may prove prohibitive. So our fallback plan is to have a StateStore per DC and have jobs restricted to within a DC and have a simple rules based global router that redirects user request to the target DC.

          Show
          subru Subru Krishnan added a comment - Lei Guo Thanks for taking a look at our design proposal and your pertinent question. Adding to Carlo Curino 's answer while we are planning to evaluate cross-DC workloads as the architecture is generic for the control flow (job scheduling) as you rightly observed, the data movement across DCs itself may prove prohibitive. So our fallback plan is to have a StateStore per DC and have jobs restricted to within a DC and have a simple rules based global router that redirects user request to the target DC.
          Hide
          subru Subru Krishnan added a comment -

          Uploading a patch file that demonstrates the core YARN Federation end2end. We are able to federate multiple (8 in our case) YARN sub-clusters and run applications transparently across the federated cluster. We validated against MR, Spark and GridMix workloads with zero code change/recompiling.

          We will be actively working in the next few weeks to clean up and break down this uber patch into smaller patches and uploading them to the correspoding sub-tasks by restructuring code, adding more test cases, comments etc

          Show
          subru Subru Krishnan added a comment - Uploading a patch file that demonstrates the core YARN Federation end2end. We are able to federate multiple (8 in our case) YARN sub-clusters and run applications transparently across the federated cluster. We validated against MR, Spark and GridMix workloads with zero code change/recompiling. We will be actively working in the next few weeks to clean up and break down this uber patch into smaller patches and uploading them to the correspoding sub-tasks by restructuring code, adding more test cases, comments etc
          Hide
          grey Lei Guo added a comment -

          Thanks Subru Krishnan and Carlo Curino, maybe we should also consider the router to be per DC.

          Another question, for Carlo Curino's comment, " the AMRMProxy of YARN-2884 can be run only on a small cluster that work as a launch pad for the federation.", can you elaborate a little bit on the desired end to end behavior? The description in the design is different to the comment, "The AMRMProxy runs on all the NM machines and acts as a proxy to the YARN RM for the AMs by implementing the ApplicationMasterProtocol."

          Show
          grey Lei Guo added a comment - Thanks Subru Krishnan and Carlo Curino , maybe we should also consider the router to be per DC. Another question, for Carlo Curino 's comment, " the AMRMProxy of YARN-2884 can be run only on a small cluster that work as a launch pad for the federation.", can you elaborate a little bit on the desired end to end behavior? The description in the design is different to the comment, "The AMRMProxy runs on all the NM machines and acts as a proxy to the YARN RM for the AMs by implementing the ApplicationMasterProtocol."
          Hide
          curino Carlo Curino added a comment -

          Lei, first let me make sure we are on the same page regarding router. The router is "soft-state" and a rather lightweight components, so we envision multiple routers to run in each data-center, and definitely agreed that we will have at least one router per DC if/when we run a federation cross-DC.

          Lei, regarding the (good) question you asked about ARMMProxy.

          The comment is derived from some early experimentation we did with the AMRMProxy from YARN-2884. The idea is that you could use the mux/demux mechanics that the AMRMproxy provides to hide multiple standalone YARN clusters (not part of a federation), behind a single AMRMProxy. The scenarios goes as follows, you have a (possibly small) cluster that I will call the "launchpad" running one or more AMRMProxy(s), and say 2 standalone YARN clusters (C1, C2) that are not federation enabled. Jobs can be submitted to C1, C2 directly as always, and jobs that want to span, could be submitted to the "launchpad" cluster. By customizing the policy in the AMRMProxy that determines how we forward requests to clusters, you can have an AM running on the launchpad cluster to forward the requests to both C1 and C2. For C1 and C2 this will look like as if you submitted an unmanaged AM in each cluster. The job on the other hand thinks he is talking with a single RM that happens to run somewhere in the "launchpad" cluster (typically on the same node), but this is just the AMRMProxy impersonating an RM.

          To make this even more clear: we don't strictly need an AMRMProxy on each node for the story to work. However, given our current thinking/experimentation we see advantages in running the AMRMProxy on each node, such as: we avoid 2 network hops, we have a better AM-AMRMProxy ratios so we are more resilient to DDOS on the AMRMProtocol, less partitioning scenarios to consider, etc... so this is what we are advocating for in federation.

          In federation, we go a step further and we ask C1 and C2 to commit to sharing resources in the federation (by heartbeating to the StateStore), and we provide lot more mechanics around it (e.g., UIs that show the overall use of resources across clusters, rebalancing mechanisms, fault-tolerance mechanics, etc..), that makes for a tighter overall experience.
          Overall, I think running the entire federation code will be better, but I was pointing out that some of the pieces we are building could be leveraged in isolation for more lightweight / ad-hoc forms of cross-cluster interaction. The rule-based global router that Subru Krishnan mentioned above falls in the same category.

          Show
          curino Carlo Curino added a comment - Lei, first let me make sure we are on the same page regarding router. The router is "soft-state" and a rather lightweight components, so we envision multiple routers to run in each data-center, and definitely agreed that we will have at least one router per DC if/when we run a federation cross-DC. Lei, regarding the (good) question you asked about ARMMProxy. The comment is derived from some early experimentation we did with the AMRMProxy from YARN-2884 . The idea is that you could use the mux/demux mechanics that the AMRMproxy provides to hide multiple standalone YARN clusters (not part of a federation), behind a single AMRMProxy. The scenarios goes as follows, you have a (possibly small) cluster that I will call the "launchpad" running one or more AMRMProxy(s), and say 2 standalone YARN clusters (C1, C2) that are not federation enabled. Jobs can be submitted to C1, C2 directly as always, and jobs that want to span, could be submitted to the "launchpad" cluster. By customizing the policy in the AMRMProxy that determines how we forward requests to clusters, you can have an AM running on the launchpad cluster to forward the requests to both C1 and C2. For C1 and C2 this will look like as if you submitted an unmanaged AM in each cluster. The job on the other hand thinks he is talking with a single RM that happens to run somewhere in the "launchpad" cluster (typically on the same node), but this is just the AMRMProxy impersonating an RM. To make this even more clear: we don't strictly need an AMRMProxy on each node for the story to work. However, given our current thinking/experimentation we see advantages in running the AMRMProxy on each node, such as: we avoid 2 network hops, we have a better AM-AMRMProxy ratios so we are more resilient to DDOS on the AMRMProtocol, less partitioning scenarios to consider, etc... so this is what we are advocating for in federation. In federation, we go a step further and we ask C1 and C2 to commit to sharing resources in the federation (by heartbeating to the StateStore), and we provide lot more mechanics around it (e.g., UIs that show the overall use of resources across clusters, rebalancing mechanisms, fault-tolerance mechanics, etc..), that makes for a tighter overall experience. Overall, I think running the entire federation code will be better, but I was pointing out that some of the pieces we are building could be leveraged in isolation for more lightweight / ad-hoc forms of cross-cluster interaction. The rule-based global router that Subru Krishnan mentioned above falls in the same category.
          Hide
          subru Subru Krishnan added a comment -

          Attaching the slide deck from YARN Federation BoF session

          Show
          subru Subru Krishnan added a comment - Attaching the slide deck from YARN Federation BoF session
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          One thing that occurred to me in an offline conversation with Subru Krishnan and Wangda Tan is about the modeling of queues and their shares in different sub-clusters.

          As seems to be already proposed, it is very desirable to have a unified logic queues that are applicable across all sub-clusters.

          With unified logical queues, looks like there are some proposals for ways of how resources can get sub-divided amongst different sub-clusters. But to me, they already map to an existing concept in YARN - Node Partitions / node-labels !

          Essentially you have one YARN cluster -> multiple sub-clusters -> each sub-cluster with multiple node-partitions. This can further be extended to more levels. For e.g. we can unify rack also under the same concept.

          The advantage of unifying this with node-partitions is that we can have

          • one single administrative view philosophy of sub-clusters, node-partitions, racks etc
          • unified configuration mechanisms: Today we support centralized and distributed node-partition mechanisms, exclusive / non-exclusive access etc.
          • unified queue-sharing models - today we already can assign X% of a node-partition to a queue. This way we can, again, reuse existing concepts, mental models and allocation policies - instead of creating specific policies for sub-cluster sharing like the user-based share that is proposed.

          We will have to dig deeper into the details, but it seems to me that node-partition and sub-cluster are equivalence classes except for the fact that two sub-clusters report to two different RMs (physically / implementation wise) which isn't the case today with node-partitions.

          Thoughts? /cc Carlo Curino Chris Douglas

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - One thing that occurred to me in an offline conversation with Subru Krishnan and Wangda Tan is about the modeling of queues and their shares in different sub-clusters. As seems to be already proposed, it is very desirable to have a unified logic queues that are applicable across all sub-clusters. With unified logical queues, looks like there are some proposals for ways of how resources can get sub-divided amongst different sub-clusters. But to me, they already map to an existing concept in YARN - Node Partitions / node-labels ! Essentially you have one YARN cluster -> multiple sub-clusters -> each sub-cluster with multiple node-partitions . This can further be extended to more levels. For e.g. we can unify rack also under the same concept. The advantage of unifying this with node-partitions is that we can have one single administrative view philosophy of sub-clusters, node-partitions, racks etc unified configuration mechanisms: Today we support centralized and distributed node-partition mechanisms, exclusive / non-exclusive access etc. unified queue-sharing models - today we already can assign X% of a node-partition to a queue. This way we can, again, reuse existing concepts, mental models and allocation policies - instead of creating specific policies for sub-cluster sharing like the user-based share that is proposed. We will have to dig deeper into the details, but it seems to me that node-partition and sub-cluster are equivalence classes except for the fact that two sub-clusters report to two different RMs (physically / implementation wise) which isn't the case today with node-partitions. Thoughts? /cc Carlo Curino Chris Douglas
          Hide
          curino Carlo Curino added a comment - - edited

          Vinod Kumar Vavilapalli, Incidentally we were discussing this with Subru Krishnan just yesterday.

          Philosophically:
          I agree with you that node-labels and node-label expressions are very powerful and could subsume much of the rest of yarn locality/sub-clusters etc.

          Another aspect that makes this equivalence somewhat pleasant is that in some reasonably restricted scenario this is quite natural. E.g., given two node-partitions labels (blue, red), at the moment the CapacityScheduler behaves almost as if the world of blue nodes and red nodes are completely orthogonal to each other. Mapping this onto having two separate RMs dealing with blue nodes and red nodes should be rather straightforward. This is to say that if we simply "paint" each sub-cluster blue or red, it is not to hard to enforce this. Admins could use this concept to manually allocate capacity onto physical sub-clusters by manipulating labels instead of thinking of sub-clusters too explicitly.

          The two concerns of these are:

          1. labels are very generic and we risk to confuse the admins as we use the same constructs to refer to very physical or very logical entities (good and bad)
          2. Handling richer/more complex intersection of node-label partitions and sub-clusters notions (i.e., where they are not aligned as I described) might get trickier and requires the "digging deeper" you suggested.

          All in all, I am in favor of this especially if we also tackle more substantially the scheduler rewrite work we have discussed.

          Practically:
          I think we should land a v0 of federation with all basic mechanisms in place, but with a somewhat limited admin surface that is not fully transparent yet. (i.e., we give users full transparency, but ask a little more to our admins to begin with). This allows us to harden much of the internals and mechanics, before polishing all the tooling around it. Priority-wise this is very important to us.

          In v1 (soon after) we will improve this with:

          1. admin tooling that maps the single-logical view of (queue + labels) to multiple subclusters queues + labels transparently (achieving I think what you ask as an admin experience),
          2. policies that direct job's asks based on labels+locality (providing the physical substrate to support (1)).

          Note that the general architecture makes (1) and (2) quite feasible. For example, if you look at the policies I just posted in YARN-5324, YARN-5325 it is easy (literally a handful of LOC) to modify the "routing" behavior to be based on node-labels while reusing much of the rest of the mechanics around it. In fact, if you or Tan, Wangda have time/interest to work on this I am happy to help you orient yourself in what we are doing in the policy space.

          Show
          curino Carlo Curino added a comment - - edited Vinod Kumar Vavilapalli , Incidentally we were discussing this with Subru Krishnan just yesterday. Philosophically: I agree with you that node-labels and node-label expressions are very powerful and could subsume much of the rest of yarn locality/sub-clusters etc. Another aspect that makes this equivalence somewhat pleasant is that in some reasonably restricted scenario this is quite natural. E.g., given two node-partitions labels (blue, red), at the moment the CapacityScheduler behaves almost as if the world of blue nodes and red nodes are completely orthogonal to each other. Mapping this onto having two separate RMs dealing with blue nodes and red nodes should be rather straightforward. This is to say that if we simply "paint" each sub-cluster blue or red, it is not to hard to enforce this. Admins could use this concept to manually allocate capacity onto physical sub-clusters by manipulating labels instead of thinking of sub-clusters too explicitly. The two concerns of these are: labels are very generic and we risk to confuse the admins as we use the same constructs to refer to very physical or very logical entities (good and bad) Handling richer/more complex intersection of node-label partitions and sub-clusters notions (i.e., where they are not aligned as I described) might get trickier and requires the "digging deeper" you suggested. All in all, I am in favor of this especially if we also tackle more substantially the scheduler rewrite work we have discussed. Practically: I think we should land a v0 of federation with all basic mechanisms in place, but with a somewhat limited admin surface that is not fully transparent yet. (i.e., we give users full transparency, but ask a little more to our admins to begin with). This allows us to harden much of the internals and mechanics, before polishing all the tooling around it. Priority-wise this is very important to us. In v1 (soon after) we will improve this with: admin tooling that maps the single-logical view of (queue + labels) to multiple subclusters queues + labels transparently (achieving I think what you ask as an admin experience), policies that direct job's asks based on labels+locality (providing the physical substrate to support (1)). Note that the general architecture makes (1) and (2) quite feasible. For example, if you look at the policies I just posted in YARN-5324 , YARN-5325 it is easy (literally a handful of LOC) to modify the "routing" behavior to be based on node-labels while reusing much of the rest of the mechanics around it. In fact, if you or Tan, Wangda have time/interest to work on this I am happy to help you orient yourself in what we are doing in the policy space.
          Hide
          subru Subru Krishnan added a comment -

          Attaching the slide deck from our Hadoop summit talk as that's a good primer for quick reference.

          Show
          subru Subru Krishnan added a comment - Attaching the slide deck from our Hadoop summit talk as that's a good primer for quick reference.
          Hide
          zshao Zheng Shao added a comment -

          Thanks Subru Krishnan and Carlo Curino. This is very relevant to us at Uber since we also run multiple YARN clusters across multiple data centers.

          Show
          zshao Zheng Shao added a comment - Thanks Subru Krishnan and Carlo Curino . This is very relevant to us at Uber since we also run multiple YARN clusters across multiple data centers.
          Hide
          andrew.wang Andrew Wang added a comment -

          Hey folks, has this been merged to trunk yet? The vote ended over two weeks ago.

          Show
          andrew.wang Andrew Wang added a comment - Hey folks, has this been merged to trunk yet? The vote ended over two weeks ago.
          Hide
          curino Carlo Curino added a comment -

          Andrew Wang yes the branch has been merged in trunk. We are waiting to complete the UI jira to close this issue.
          We will probably backport this also to branch-2. The rest of "phase 2" for federation is tracked in YARN-5597.

          Show
          curino Carlo Curino added a comment - Andrew Wang yes the branch has been merged in trunk. We are waiting to complete the UI jira to close this issue. We will probably backport this also to branch-2. The rest of "phase 2" for federation is tracked in YARN-5597 .

            People

            • Assignee:
              subru Subru Krishnan
              Reporter:
              sriramsrao Sriram Rao
            • Votes:
              2 Vote for this issue
              Watchers:
              79 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development