Task Throttling Per Participant
This limitation conceptually equals to max thread pool size in TaskStateModelFactory on the participant.
If user constructs TaskStateModelFactory using their customized executor with a limited sized thread pool, that participant will never execute more tasks than the threshold.
The problem is that since the limitation is not known by the controller, tasks will still be assigned to the participant. And they will be queued in participant thread pool and never re-assigned.
It makes more sense to throttle tasks in the controller. At the same time that tasks are assigned to participants.
Basically, a participant is configured with a "MaxRunnigTasksNumber". And the controller assigns task accordingly.
When calculating Best possible state in the JobRebalancer
Foreach Job in RunnableJobs:
TaskToParticipantMapping = CalculateAssignment(Job)
Foreach MappingEntry in TaskToParticipantMapping:
If Running_task + ToBeAssigned_task exceeds Participant_Task_Threshold:
[Stretch] Try next applicable participant (consider task attached to resource)
The above logic can be considered as a task queue algorithm. However, the original assignment will keep relying on current logic. So if all participants have enough capacity, tasks will still be evenly dispatched.
"id" : "localhost_12918",
"HELIX_ENABLED" : "true",
"HELIX_ENABLED_TIMESTAMP" : "1493326930182",
"HELIX_HOST" : "localhost",
"HELIX_PORT" : "12918",
"MAX_RUNNING_TASK" : "55"
For old participants, the controller assumes the thread pool is with a default capacity 40 (equal to default message handling thread pool size).
Note that if some tasks have the workload that is much heavier than others, only control tasks number won't work.
In this design, we assume that tasks have the approximately same workload.
Existing JobRebalancer will be trigger every time a state change event happens. That means completely sorting all pending jobs/tasks and calculate assignment.
A better strategy is to maintain a Job priority queue in the controller.
When a job became runnable, enqueue.
When a job is complete, dequeue.
Any task state update, check participant capacity and assign the task from the queue if possible.
This refactoring is considered as a stretch goal.
"GlobalMaxConcurrentJobNumber" Per Cluster
Helix controller restricts the number of running jobs.
However, with this throttling, once a job is scheduled, it will occupy the slot until the finish. This will be bad when all the running jobs are long-run. No new jobs will be scheduled.
Moreover, it's harder for admin to set a reasonable total job count, given workflows and jobs are usually quite different regarding their real workload.
Comparing these 2 options, "MaxRunnigTasksPerParticipant" is directly related to participant's capacity. Once the controller schedule tasks according to this, we can for sure avoid overloading the instances.
Even we throttle jobs, there is no guarantee about the running thread in each participant.
Moreover, a user can currently control job scheduling by adjusting the frequency of submitting jobs. So "GlobalMaxConcurrentJobNumber" is not necessary.
Given limited resource, which job we schedule first?
Schedule the jobs with the highest priority first until participants are full
In this design, we proposed the simplest solution for priority control.
The user can configure job resource priority or Helix will assume "age" (time that the job was scheduled) as a priority.
If part of the jobs is assigned priority, others are not, Helix will assume jobs with priority setting have higher priority.
One issue here is that if the application keeps sending high priority jobs to Helix, lower priority jobs will be starving.
Since this is controlled by the application (and mostly desired result), Helix won't apply any additional control on these starving jobs.
Our plan is:
Step 1. Support job "start time" based priority
Step 2. Support user defined priority
Option 1. Using per-job and per-workflow concurrency control to implement priority
WorkflowConfig.ParallelJobs and JobConfig.numConcurrentTasksPerInstance are used to control how many jobs and tasks can be executed in parallel within a single workflow.
Given that the cluster administrators can configure these numbers "correctly", workflows will be assigned expected resources eventually.
However, there is no promising that high priority workflows will be scheduled before others. This is because tasks are picked up randomly, so the controller may end up with fill the task pool with all items from low priority workflows.
- Hard for users to setup the right numbers.
- Cannot strictly ensure priority.
- May lead to low utilization.
Option 2. Jobs are assigned to execution slots according to priority
Helix controller assigns different portions of resources (execute slots) to jobs according to their priority. For instance, we may have following assignment given the total capacity is 100.
So high priority jobs will always get a larger portion of resources. If any job does not use all of its portions, our algorithm should be smart enough to assign those portion to other jobs.
The problem of this method is complexity. In addition, since we are not assigning all possible resource to the highest priority jobs, those jobs are not guaranteed to be finished quickly, and users might feel confusing.