Thanks for comments:
We probably need to address this properly in the JIRA that tracks container resource increase roll back. (I think Container resource increase expiration should be tracked as a Scheduler Event, e.g., SchedulerEventType.CONTAINER_INCREASE_EXPIRE)
I think we can do that either via adding CONTAINER_INCREASE_EXPIRE or directly call decrease of scheduler from RMContainer, I'm not sure which one is better, let's figure it out when doing it.
It seems that this function throws exception whenever there is a duplicated id. Shall we handle the case where if there are both increase and decrease requests for the same id, we can ignore the increase but keep the decrease request?
I have thought about this before, I think it's hard to decide when two request with same containerId but different target resource exists, which one will be chosen. And it's not like an expected allocate request as well. So I prefer to reject both.
Will it be better to combine all sanity checks into one function
For validateIncreaseDecreaseRequest, we don't check minimum allocation now, is it intended?
Yes it's intended, because we will normalize it later, so no need to throw exception.
This function is used by both pullNewlyIncreasedContainers(), and pullNewlyDecreasedContainers(). Why do we need to call updateContainerAndNMToken for decreased containers? It also unnecessarily send a ACQUIRE_UPDATED_CONTAINER event for every decreased container?
This is majorly makes logic correct and consistent, we don't use the container token for now, but I think we should make it updated before return to app. Unless we have any performance issue of doing this, I prefer to keep existing behavior.
We should probably check null before adding updatedContainer?
It seems no need to do the null check here. When it becomes null? I prefer to keep it as-is and it will throw NPE if any fatal issue happens.
AppSchedulingInfo#notifyContainerStopped not being used.
Removed, we handled this in LeafQueue#completedContainer.
I think the following is a typo, should be if (cannotAllocateAnything), right?
Not sure if I understand the logic. Why only break when node.getReservedContainer() == null? Shouldn't we break out of the loop here no matter what?
Nice catch! I fixed this, we should break when we allocated or reserved anything.
I think earlier in the allocateIncreaseRequest() function, if a new increase is successfully allocated, application.increaseContainer(increaseRequest) will have removed the increase request already?
Another nice catch! Yes we should already handled it in application.increaseContainer
RMContainerImpl...Shouldn't it be changed to...
Yes, it should do as you said, updated.
Also, is container.containerIncreased really needed?
It's needed when we don't know if a acquired event is for an increasedContainer or decreasedContainer, added isIncreaseContainer to acquire event (now it's RMContainerUpdatesAcquiredEvent). And removed RMContainerImpl.containerIncreased.