[YARN-10760] Number of allocated OPPORTUNISTIC containers can dip below 0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.1.2
Fix Version/s: 3.4.0
Component/s: resourcemanager
Labels:
- pull-request-available

Hadoop Flags:

Reviewed

Description

AbstractYarnScheduler.completedContainers can potentially be called from multiple sources, yet it appears that there are scenarios in which the caller does not hold the appropriate lock, which can lead to the count of OpportunisticSchedulerMetrics.AllocatedOContainers falling below 0.
To prevent double counting when releasing allocated O containers, a simple fix might be to check if the RMContainer has already been removed beforehand, though that may not fix the underlying issue that causes the race condition.

Following is "capture" of OpportunisticSchedulerMetrics.AllocatedOContainers falling below 0 via a JMX query:

{
    "name" : "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics",
    "modelerType" : "OpportunisticSchedulerMetrics",
    "tag.OpportunisticSchedulerMetrics" : "ResourceManager",
    "tag.Context" : "yarn",
    "tag.Hostname" : "",
    "AllocatedOContainers" : -2716,
    "AggregateOContainersAllocated" : 306020,
    "AggregateOContainersReleased" : 308736,
    "AggregateNodeLocalOContainersAllocated" : 0,
    "AggregateRackLocalOContainersAllocated" : 0,
    "AggregateOffSwitchOContainersAllocated" : 306020,
    "AllocateLatencyOQuantilesNumOps" : 0,
    "AllocateLatencyOQuantiles50thPercentileTime" : 0,
    "AllocateLatencyOQuantiles75thPercentileTime" : 0,
    "AllocateLatencyOQuantiles90thPercentileTime" : 0,
    "AllocateLatencyOQuantiles95thPercentileTime" : 0,
    "AllocateLatencyOQuantiles99thPercentileTime" : 0
  }

UPDATE: Upon further investigation, it seems that the culprit is that we are not incrementing AllocatedOContainers when the RM restarts, so the deallocation still decrements the recovered OContainers, but we never increment them on recovery. We have an initial fix for this, and are waiting for verification of the fix.

Attachments

Issue Links

links to

GitHub Pull Request #3642

Activity

People

Assignee:: Andrew Chung

Reporter:: Andrew Chung

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Apr/21 14:19

Updated:: 23/Nov/21 21:22

Resolved:: 23/Nov/21 21:22

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1.5h