[YARN-8849] DynoYARN: A simulation and testing infrastructure for YARN clusters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Traditionally, YARN workload simulation is performed using SLS (Scheduler Load Simulator) which is packaged with YARN. It Essentially, starts a full fledged ResourceManager, but runs simulators for the NodeManager and the ApplicationMaster Containers. These simulators are lightweight and run in a threadpool. The NM simulators do not open any external ports and send (in-process) heartbeats to the ResourceManager.

There are a couple of drawbacks with using the SLS:

It might be difficult to simulate really large clusters without having access to a very beefy box - since the NMs are launched as tasks in a threadpool, and each NM has to send periodic heartbeats to the RM.
Certain features (like YARN-1011) requires changes to the NodeManager - aspects such as queuing and selectively killing containers have to be incorporated into the existing NM Simulator which might make the simulator a bit heavy weight - there is a need for locking and synchronization.
Since the NM and AM are simulations, only the Scheduler is faithfully tested - it does not really perform an end-2-end test of a cluster.

Therefore, drawing inspiration from Dynamometer, we propose a framework for YARN deployable YARN cluster - DynoYARN - for testing, with the following features:

The NM already has hooks to plug-in custom ContainerExecutor and NodeResourceMonitor. If we can plug-in a custom ContainersMonitorImpl's Monitoring thread (and other modules like the LocalizationService), We can probably inject an Executor that does not actually launch containers and a Node and Container resource monitor that reports synthetic pre-specified Utilization metrics back to the RM.
Since we are launching fake containers, we cannot run normal AM containers. We can therefore, use Unmanaged AM's to launch synthetic jobs.

Essentially, a test workflow would look like this:

Launch a DynoYARN cluster.
Use the Unmanaged AM feature to directly negotiate with the DynaYARN Resource Manager for container tokens.
Use the container tokens from the RM to directly ask the DynoYARN Node Managers to start fake containers.
The DynoYARN NodeManagers will start the fake containers and report to the DynoYARN Resource Manager synthetically generated resource utilization for the containers (which will be injected via the ContainerLaunchContext and parsed by the plugged-in Container Executor).
The Scheduler will use the utilization report to schedule containers - we will be able to test allocation of Opportunistic containers based on resource utilization.
Since the DynoYARN Node Managers run the actual code paths, all preemption and queuing logic will be faithfully executed.

Attachments

Issue Links

relates to

YARN-1011 [Umbrella] Schedule containers based on utilization of currently allocated containers

Open

links to

DynoYARN github repo

Sub-Tasks

Make certain aspects of the NM pluggable to support a DynoYARN cluster

Open

Unassigned

Activity

People

Assignee:: Jonathan Hung

Reporter:: Arun Suresh

Votes:: 0 Vote for this issue

Watchers:: 33 Start watching this issue

Dates

Created:: 05/Oct/18 04:00

Updated:: 20/Sep/21 17:47

Resolved:: 20/Sep/21 17:47