Thanks for raising this Haibo Chen / Karthik Kambatla. Some thoughts:
Why do we need maxOppQueueLength given queuingLimit?
So, maxOppQueueLength is more like an active limit. The CS (ContainerScheduler) will not admit any more containers than that value. While the queuingLimit is more reactive and dynamically calculated by the RM and passed down to the NM in a HB response. The RM constantly calculates the mean/median of the queueLengths on all nodes and it tells the NM to shed containers from the queue if it is too high. I agree that the maxOppQueueLength can probably be removed though. But given your observation in
YARN-6706 that test cases depends on this, my opinion is that we will keep it, and put a very high value by default - and mark it as VisibileForTesting only.
Is there value in splitting runningContainers into runningGuaranteed and runningOpportunistic ?
Hmm… I was actually thinking of removing the runningContainers itself. It was introduced to keep track of all running containers (containers whose state is running) AND those that have been scheduled but not yet running. I think it may be better to encapsulate that as a proper container state, something like SCHEDULED_TO_RUN via a proper transition.
Adding more data structures might be problematic later on, since we can hit minor race conditions when transferring containers from runningGuaranteed to running Opportunistic (during promotion) and vice-versa (during demotion) if we are not careful about synchronization etc. Also, given the fact that a NM will not run more than say a couple of 100 containers, it might be better to just iterate over all the containers when the scheduler needs to make a decision.
Another problem with keeping a separate map is during NM recovery, we have to populate this specifically. we don’t do that for running containers now either – but I was thinking if we removed the runningContainers map, we wont have to (we already have a state called QUEUED in the NMStateStore which can be used to set the correct state in the recovered container)
getOpportunisticContainersStatus method implementation feels awkward..
Kind of agree with you there, don’t recall exactly why we did it like that… think it was to not have to create a new instance of the status at every heart beat.
Have we considered folding ContainerQueuingLimit class into this
My first instinct is to keep it separate. Don’t think we should mix the Queuing aspect of the Container Scheduler with the ExecutionType aspect. Also, one is part of the NM heartbeat request and the other comes back as response.