Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.4.0
-
None
Description
ambari-server --hash
dc340e8c6cb4fa6c062f805cc1917f62299a5f50
ambari-server-2.4.0.0-622.x86_64
Steps
- Deploy HDP-2.4.0.0 cluster with Ambari 2.4.0.0 (unsecure, non-HA cluster, SSL enabled)
- Start EU to HDP-2.5.0.0-609
Result
While EU is in progress, found that Ambari server seems to have hung; the login page loads, but unable to login; The following API call hangs too – https://server:8443/api/v1/clusters/cl1/
There is a deadlock when trying to update the stale configuration cache:
"Server Action Executor Worker 401" #225 prio=5 os_prio=0 tid=0x00007fa07c03e800 nid=0x65df waiting on condition [0x00007fa0737ef000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000a059d4f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = LockType.WRITE) at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.lockTransaction(AmbariJpaLocalTxnInterceptor.java:291) at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:114) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) at org.apache.ambari.server.state.cluster.ClusterImpl$$EnhancerByGuice$$991b84fc.applyConfigs(<generated>) --> clusterGlobalLock.writeLock().lock(); at org.apache.ambari.server.state.cluster.ClusterImpl.addDesiredConfig(ClusterImpl.java:2340) at org.apache.ambari.server.state.ConfigHelper.createConfigTypes(ConfigHelper.java:897) at org.apache.ambari.server.controller.internal.UpgradeResourceProvider.applyStackAndProcessConfigurations(UpgradeResourceProvider.java:1174)
"ambari-hearbeat-monitor" #23 prio=5 os_prio=0 tid=0x00007fa07476c000 nid=0x20ad waiting on condition [0x00007fa07bbfb000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000a32d44a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.ambari.server.state.cluster.ClusterImpl.getDesiredStackVersion(ClusterImpl.java:1052) at org.apache.ambari.server.state.ConfigHelper.calculateIsStaleConfigs(ConfigHelper.java:1075) --> @TransactionalLock(lockArea = LockArea.STALE_CONFIG_CACHE, lockType = LockType.READ) at org.apache.ambari.server.state.ConfigHelper.isStaleConfigs(ConfigHelper.java:456) at org.apache.ambari.server.agent.HeartbeatMonitor.createStatusCommand(HeartbeatMonitor.java:311)
This is another case of an Ambari cache competing with a JPA transaction. Consider these steps:
- A new configuration is created within the context of a Transaction
- Within that same Transaction, the stale configuration cache is told to invalidate
- After purging the old data, but before the Transaction is committed, another thread tries to read from the cache. It ends up re-populating the old data.
Sometimes the code works because the Transaction is able to committ before the cache is re-populated by another thread. In theory, we should be locking around reading the cache to ensure that there isn't a transaction writing to it. However, this is what caused the deadlock since it interferes with our wonder "cluster global lock of doom".
Instead, it's safer in this case to just invalidate the cache after the Transaction completes.
- We do this invalidate on a separate thread to ensure we don't have issues with the cluster global lock
- Since the cache isn't needed within the context of the invalidation call, it's OK to purge it asynchronously.
Attachments
Attachments
Issue Links
- links to