Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
After merging YUNIKORN-2539, we already saw a potential issue with rmproxy.RMProxy and cache.Context:
Gourutine 1:
github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:307 rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<< github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:306 rmproxy.(*RMProxy).GetResourceManagerCallback ??? github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ??? github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 cache.(*Context).updateNodeResources ??? github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 cache.(*Context).updateNodeOccupiedResources ??? github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 cache.(*Context).updateForeignPod ??? github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 cache.(*Context).UpdatePod ???
Goroutine 2:
github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 cache.(*Context).ForgetPod ??? <<<<< github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 cache.(*Context).ForgetPod ??? github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 cache.(*AsyncRMCallback).UpdateAllocation ??? github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:162 rmproxy.(*RMProxy).triggerUpdateAllocation ??? github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:150 rmproxy.(*RMProxy).processRMReleaseAllocationEvent ??? github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:234 rmproxy.(*RMProxy).handleRMEvents ???
Right now this seems to be safe because we only call RLock() in the RMProxy methods. However, should any of this change, we're in trouble due to lock ordering (Cache->RMProxy and RMProxy->Cache).
We need to investigate why we use only RLock() and whether it's needed at all. If nothing is modified, then we can drop the mutex completely.
Attachments
Issue Links
- links to