Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-17106

Deadlock While Updating Stale Configuration Cache During Upgrade

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.4.0
    • 2.4.0
    • ambari-server
    • None

    Description

      ambari-server --hash
      dc340e8c6cb4fa6c062f805cc1917f62299a5f50
      ambari-server-2.4.0.0-622.x86_64

      Steps

      1. Deploy HDP-2.4.0.0 cluster with Ambari 2.4.0.0 (unsecure, non-HA cluster, SSL enabled)
      2. Start EU to HDP-2.5.0.0-609

      Result
      While EU is in progress, found that Ambari server seems to have hung; the login page loads, but unable to login; The following API call hangs too – https://server:8443/api/v1/clusters/cl1/

      There is a deadlock when trying to update the stale configuration cache:

      "Server Action Executor Worker 401" #225 prio=5 os_prio=0 tid=0x00007fa07c03e800 nid=0x65df waiting on condition [0x00007fa0737ef000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000000a059d4f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
      	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
      
        --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = LockType.WRITE)
      
        at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.lockTransaction(AmbariJpaLocalTxnInterceptor.java:291)
      	at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:114)
      	at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
      	at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
      	at org.apache.ambari.server.state.cluster.ClusterImpl$$EnhancerByGuice$$991b84fc.applyConfigs(<generated>)
      
        --> clusterGlobalLock.writeLock().lock();
      
        at org.apache.ambari.server.state.cluster.ClusterImpl.addDesiredConfig(ClusterImpl.java:2340)
      	at org.apache.ambari.server.state.ConfigHelper.createConfigTypes(ConfigHelper.java:897)
      	at org.apache.ambari.server.controller.internal.UpgradeResourceProvider.applyStackAndProcessConfigurations(UpgradeResourceProvider.java:1174)
      
      "ambari-hearbeat-monitor" #23 prio=5 os_prio=0 tid=0x00007fa07476c000 nid=0x20ad waiting on condition [0x00007fa07bbfb000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000000a32d44a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
      	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
      	at org.apache.ambari.server.state.cluster.ClusterImpl.getDesiredStackVersion(ClusterImpl.java:1052)
      	at org.apache.ambari.server.state.ConfigHelper.calculateIsStaleConfigs(ConfigHelper.java:1075)
      
        --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = LockType.READ)
        
      	at org.apache.ambari.server.state.ConfigHelper.isStaleConfigs(ConfigHelper.java:456)
      	at org.apache.ambari.server.agent.HeartbeatMonitor.createStatusCommand(HeartbeatMonitor.java:311)
      

      This is another case of an Ambari cache competing with a JPA transaction. Consider these steps:

      • A new configuration is created within the context of a Transaction
      • Within that same Transaction, the stale configuration cache is told to invalidate
      • After purging the old data, but before the Transaction is committed, another thread tries to read from the cache. It ends up re-populating the old data.

      Sometimes the code works because the Transaction is able to committ before the cache is re-populated by another thread. In theory, we should be locking around reading the cache to ensure that there isn't a transaction writing to it. However, this is what caused the deadlock since it interferes with our wonder "cluster global lock of doom".

      Instead, it's safer in this case to just invalidate the cache after the Transaction completes.

      • We do this invalidate on a separate thread to ensure we don't have issues with the cluster global lock
      • Since the cache isn't needed within the context of the invalidation call, it's OK to purge it asynchronously.

      Attachments

        1. AMBARI-17106.patch
          14 kB
          Jonathan Hurley

        Issue Links

          Activity

            People

              jonathanhurley Jonathan Hurley
              jonathanhurley Jonathan Hurley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: