[AMBARI-17106] Deadlock While Updating Stale Configuration Cache During Upgrade - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.4.0
Component/s: ambari-server
Labels:
None

Description

ambari-server --hash
dc340e8c6cb4fa6c062f805cc1917f62299a5f50
ambari-server-2.4.0.0-622.x86_64

Steps

Deploy HDP-2.4.0.0 cluster with Ambari 2.4.0.0 (unsecure, non-HA cluster, SSL enabled)
Start EU to HDP-2.5.0.0-609

Result
While EU is in progress, found that Ambari server seems to have hung; the login page loads, but unable to login; The following API call hangs too – https://server:8443/api/v1/clusters/cl1/

There is a deadlock when trying to update the stale configuration cache:

"Server Action Executor Worker 401" #225 prio=5 os_prio=0 tid=0x00007fa07c03e800 nid=0x65df waiting on condition [0x00007fa0737ef000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000a059d4f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)

  --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = LockType.WRITE)

  at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.lockTransaction(AmbariJpaLocalTxnInterceptor.java:291)
	at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:114)
	at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
	at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
	at org.apache.ambari.server.state.cluster.ClusterImpl$$EnhancerByGuice$$991b84fc.applyConfigs(<generated>)

  --> clusterGlobalLock.writeLock().lock();

  at org.apache.ambari.server.state.cluster.ClusterImpl.addDesiredConfig(ClusterImpl.java:2340)
	at org.apache.ambari.server.state.ConfigHelper.createConfigTypes(ConfigHelper.java:897)
	at org.apache.ambari.server.controller.internal.UpgradeResourceProvider.applyStackAndProcessConfigurations(UpgradeResourceProvider.java:1174)

"ambari-hearbeat-monitor" #23 prio=5 os_prio=0 tid=0x00007fa07476c000 nid=0x20ad waiting on condition [0x00007fa07bbfb000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000a32d44a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
	at org.apache.ambari.server.state.cluster.ClusterImpl.getDesiredStackVersion(ClusterImpl.java:1052)
	at org.apache.ambari.server.state.ConfigHelper.calculateIsStaleConfigs(ConfigHelper.java:1075)

  --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = LockType.READ)
  
	at org.apache.ambari.server.state.ConfigHelper.isStaleConfigs(ConfigHelper.java:456)
	at org.apache.ambari.server.agent.HeartbeatMonitor.createStatusCommand(HeartbeatMonitor.java:311)

This is another case of an Ambari cache competing with a JPA transaction. Consider these steps:

A new configuration is created within the context of a Transaction
Within that same Transaction, the stale configuration cache is told to invalidate
After purging the old data, but before the Transaction is committed, another thread tries to read from the cache. It ends up re-populating the old data.

Sometimes the code works because the Transaction is able to committ before the cache is re-populated by another thread. In theory, we should be locking around reading the cache to ensure that there isn't a transaction writing to it. However, this is what caused the deadlock since it interferes with our wonder "cluster global lock of doom".

Instead, it's safer in this case to just invalidate the cache after the Transaction completes.

We do this invalidate on a separate thread to ensure we don't have issues with the cluster global lock
Since the cache isn't needed within the context of the invalidation call, it's OK to purge it asynchronously.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AMBARI-17106.patch
08/Jun/16 15:04
14 kB
Jonathan Hurley

Issue Links

links to

Reviewboard

Deadlock While Updating Stale Configuration Cache During Upgrade

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates