Details

    Description

      After merging YUNIKORN-2539, we already saw a potential issue with rmproxy.RMProxy and cache.Context:

      Gourutine 1:

      github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:307 rmproxy.(*RMProxy).GetResourceManagerCallback ??? <<<<<
      github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:306 rmproxy.(*RMProxy).GetResourceManagerCallback ???
      github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:359 rmproxy.(*RMProxy).UpdateNode ???
      github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 cache.(*Context).updateNodeResources ???
      github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 cache.(*Context).updateNodeOccupiedResources ???
      github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 cache.(*Context).updateForeignPod ???
      github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 cache.(*Context).UpdatePod ???
      

      Goroutine 2:

      github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 cache.(*Context).ForgetPod ??? <<<<<
      github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 cache.(*Context).ForgetPod ???
      github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 cache.(*AsyncRMCallback).UpdateAllocation ???
      github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:162 rmproxy.(*RMProxy).triggerUpdateAllocation ???
      github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:150 rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
      github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:234 rmproxy.(*RMProxy).handleRMEvents ???
      

      Right now this seems to be safe because we only call RLock() in the RMProxy methods. However, should any of this change, we're in trouble due to lock ordering (Cache->RMProxy and RMProxy->Cache).

      We need to investigate why we use only RLock() and whether it's needed at all. If nothing is modified, then we can drop the mutex completely.

      Attachments

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: