Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11191

Global Scheduler refreshQueue cause deadLock

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 2.9.0, 3.0.0, 3.1.0, 2.10.0, 3.2.0, 3.3.0
    • None
    • capacity scheduler

    Description

      This is a potential bug may impact all open premmption  cluster.In our current version with preemption enabled, the capacityScheduler will call the refreshQueue method of the PreemptionManager when it refreshQueue. This process hold the preemptionManager write lock and  require csqueue read lock.Meanwhile,ParentQueue.canAssignToThisQueue will hold csqueue readLock and require PreemptionManager ReadLock.

      There is a possibility of deadlock at this time.Because readlock has one rule on unfair policy, when a lock is already occupied by a read lock and the first request in the lock competition queue is a write lock request,other read lock requests cann‘t acquire the lock.

      So the potential deadlock is:

      CapacityScheduler.refreshQueue: hold: PremmptionManager.writeLock
                                      require: csqueue.readLock
      
      CapacityScheduler.schedule: hold: csqueue.readLock
                                  require: PremmptionManager.readLock
      
      other thread(completeContainer,release Resource,etc.): require: csqueue.writeLock 
      
      

      The jstack logs at the time were as follows

      Attachments

        1. Lock holding status.png
          44 kB
          ben yang
        2. 1.jstack
          623 kB
          ben yang
        3. YARN-11191.001.patch
          5 kB
          ben yang

        Activity

          People

            tdomok Tamas Domok
            Kelo ben yang
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: