Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6831

Miscellaneous refactoring changes of ContainScheduler

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: nodemanager
    • Labels:
      None
    • Target Version/s:

      Description

      While reviewing YARN-6706, Karthik pointed out a few issues for improvment in ContainerScheduler

      *Make ResourceUtilizationTracker pluggable. That way, we could use a different tracker when oversubscription is enabled.

      *ContainerScheduler
      ##Why do we need maxOppQueueLength given queuingLimit?
      ##Is there value in splitting runningContainers into runningGuaranteed and runningOpportunistic?
      ##getOpportunisticContainersStatus method implementation feels awkward. How about capturing the state in the field here, and have metrics etc. pull from here?
      ##startContainersFromQueue: Local variable resourcesAvailable is unnecessary

      *OpportunisticContainersStatus
      ##Let us clearly differentiate between allocated, used and utilized. Maybe, we should rename current Used methods to Allocated?
      ##I prefer either full name Opportunistic (in method) or Opp (shortest name that makes sense). Opport is neither short nor fully descriptive.
      ##Have we considered folding ContainerQueuingLimit class into this?

      We decided to move the issues into this follow up jira to keep YARN-6706 moving forward to unblock oversubscription work.

        Activity

        Hide
        asuresh Arun Suresh added a comment - - edited

        Thanks for raising this Haibo Chen / Karthik Kambatla. Some thoughts:

        Why do we need maxOppQueueLength given queuingLimit?

        So, maxOppQueueLength is more like an active limit. The CS (ContainerScheduler) will not admit any more containers than that value. While the queuingLimit is more reactive and dynamically calculated by the RM and passed down to the NM in a HB response. The RM constantly calculates the mean/median of the queueLengths on all nodes and it tells the NM to shed containers from the queue if it is too high. I agree that the maxOppQueueLength can probably be removed though. But given your observation in YARN-6706 that test cases depends on this, my opinion is that we will keep it, and put a very high value by default - and mark it as VisibileForTesting only.

        Is there value in splitting runningContainers into runningGuaranteed and runningOpportunistic ?

        Hmm… I was actually thinking of removing the runningContainers itself. It was introduced to keep track of all running containers (containers whose state is running) AND those that have been scheduled but not yet running. I think it may be better to encapsulate that as a proper container state, something like SCHEDULED_TO_RUN via a proper transition.
        Adding more data structures might be problematic later on, since we can hit minor race conditions when transferring containers from runningGuaranteed to running Opportunistic (during promotion) and vice-versa (during demotion) if we are not careful about synchronization etc. Also, given the fact that a NM will not run more than say a couple of 100 containers, it might be better to just iterate over all the containers when the scheduler needs to make a decision.
        Another problem with keeping a separate map is during NM recovery, we have to populate this specifically. we don’t do that for running containers now either – but I was thinking if we removed the runningContainers map, we wont have to (we already have a state called QUEUED in the NMStateStore which can be used to set the correct state in the recovered container)

        getOpportunisticContainersStatus method implementation feels awkward..

        Kind of agree with you there, don’t recall exactly why we did it like that… think it was to not have to create a new instance of the status at every heart beat.

        Have we considered folding ContainerQueuingLimit class into this

        My first instinct is to keep it separate. Don’t think we should mix the Queuing aspect of the Container Scheduler with the ExecutionType aspect. Also, one is part of the NM heartbeat request and the other comes back as response.

        Show
        asuresh Arun Suresh added a comment - - edited Thanks for raising this Haibo Chen / Karthik Kambatla . Some thoughts: Why do we need maxOppQueueLength given queuingLimit? So, maxOppQueueLength is more like an active limit. The CS (ContainerScheduler) will not admit any more containers than that value. While the queuingLimit is more reactive and dynamically calculated by the RM and passed down to the NM in a HB response. The RM constantly calculates the mean/median of the queueLengths on all nodes and it tells the NM to shed containers from the queue if it is too high. I agree that the maxOppQueueLength can probably be removed though. But given your observation in YARN-6706 that test cases depends on this, my opinion is that we will keep it, and put a very high value by default - and mark it as VisibileForTesting only. Is there value in splitting runningContainers into runningGuaranteed and runningOpportunistic ? Hmm… I was actually thinking of removing the runningContainers itself. It was introduced to keep track of all running containers (containers whose state is running) AND those that have been scheduled but not yet running. I think it may be better to encapsulate that as a proper container state, something like SCHEDULED_TO_RUN via a proper transition. Adding more data structures might be problematic later on, since we can hit minor race conditions when transferring containers from runningGuaranteed to running Opportunistic (during promotion) and vice-versa (during demotion) if we are not careful about synchronization etc. Also, given the fact that a NM will not run more than say a couple of 100 containers, it might be better to just iterate over all the containers when the scheduler needs to make a decision. Another problem with keeping a separate map is during NM recovery, we have to populate this specifically. we don’t do that for running containers now either – but I was thinking if we removed the runningContainers map, we wont have to (we already have a state called QUEUED in the NMStateStore which can be used to set the correct state in the recovered container) getOpportunisticContainersStatus method implementation feels awkward.. Kind of agree with you there, don’t recall exactly why we did it like that… think it was to not have to create a new instance of the status at every heart beat. Have we considered folding ContainerQueuingLimit class into this My first instinct is to keep it separate. Don’t think we should mix the Queuing aspect of the Container Scheduler with the ExecutionType aspect. Also, one is part of the NM heartbeat request and the other comes back as response.
        Hide
        asuresh Arun Suresh added a comment -

        Do take a look at YARN-6835 where i've posted an initial patch removing runningContainers

        Show
        asuresh Arun Suresh added a comment - Do take a look at YARN-6835 where i've posted an initial patch removing runningContainers
        Hide
        asuresh Arun Suresh added a comment -

        I was thinking about removing maxOppQueueLength which led me to think about the following.
        In YARN-5972, we are trying to get the NM to pause an opportunistic container instead of killing it. Both cgroup freezer and windows job objects implement freezing in the following way:
        When a process is frozen, it's cpu share is reduced to 0 and its working set remains in memory as long as there is no external memory pressure. If the OS can't keep the frozen process in memory, it's memory is swapped out to disk and restored when the process is thawed. This implies that the number of paused containers is limited to the total swap space on the NM. This should be another local NM config, maybe something like maxConsumedOpportunisticResources which places an additional limit on number of running opportunistic containers.

        Show
        asuresh Arun Suresh added a comment - I was thinking about removing maxOppQueueLength which led me to think about the following. In YARN-5972 , we are trying to get the NM to pause an opportunistic container instead of killing it. Both cgroup freezer and windows job objects implement freezing in the following way: When a process is frozen, it's cpu share is reduced to 0 and its working set remains in memory as long as there is no external memory pressure. If the OS can't keep the frozen process in memory, it's memory is swapped out to disk and restored when the process is thawed. This implies that the number of paused containers is limited to the total swap space on the NM. This should be another local NM config, maybe something like maxConsumedOpportunisticResources which places an additional limit on number of running opportunistic containers.

          People

          • Assignee:
            haibochen Haibo Chen
            Reporter:
            haibochen Haibo Chen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development