Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
2.4.0
-
None
Description
Reproduced as part of creating a rolling upgrade on a large cluster:
- Cluster had 30 hosts with about 40 components per host (full stack)
- Begin a rolling upgrade; a lot of database activity is seen for about 1-2 minutes, and then the deadlock occurs.
Initially appearing as a deadlock, it's caused by Postgres is holding the socket open indefinitely. We have a write lock being held while the socket is open. Jstack dumps taken many minutes apart show the same thread is stuck in a socket read:
ambari-heartbeat-processor-0 acquires a write lock on ServiceComponentHostImpl at setState and is stuck reading.
"ambari-heartbeat-processor-0" #16 prio=5 os_prio=0 tid=0x00007fcc1080b000 nid=0x7e03 runnable [0x00007fcc163ce000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143) at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112) at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:71) at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:269) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1704) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) - locked <0x00000000aad71388> (a org.postgresql.core.v3.QueryExecutorImpl) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:559) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:417) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:363) at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeDirectNoSelect(DatabaseAccessor.java:892) at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.executeNoSelect(DatabaseAccessor.java:964) at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.basicExecuteCall(DatabaseAccessor.java:633) at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatch(ParameterizedSQLBatchWritingMechanism.java:149) at org.eclipse.persistence.internal.databaseaccess.ParameterizedSQLBatchWritingMechanism.executeBatchedStatements(ParameterizedSQLBatchWritingMechanism.java:134) at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.writesCompleted(DatabaseAccessor.java:1845) at org.eclipse.persistence.internal.sessions.AbstractSession.writesCompleted(AbstractSession.java:4300) at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.writesCompleted(UnitOfWorkImpl.java:5592) at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.acquireWriteLocks(UnitOfWorkImpl.java:1646) at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.commitTransactionAfterWriteChanges(UnitOfWorkImpl.java:1614) at org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.commitRootUnitOfWork(RepeatableWriteUnitOfWork.java:285) at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.commitAndResume(UnitOfWorkImpl.java:1169) at org.eclipse.persistence.internal.jpa.transaction.EntityTransactionImpl.commit(EntityTransactionImpl.java:134) at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:153) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl$$EnhancerByGuice$$e9f3a32.saveComponentStateEntityIfPersisted(<generated>) at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.setState(ServiceComponentHostImpl.java:850) at org.apache.ambari.server.agent.HeartbeatProcessor.processStatusReports(HeartbeatProcessor.java:610) at org.apache.ambari.server.agent.HeartbeatProcessor.processHeartbeat(HeartbeatProcessor.java:212) at org.apache.ambari.server.agent.HeartbeatProcessor$HeartbeatProcessingTask.run(HeartbeatProcessor.java:188)
Many other threads (for example ambari-client-thread-574) acquire a cluster read lock. Then, they are held waiting for ambari-heartbeat-processor-0 since they need a read lock on ServiceComponentHostImpl
"ambari-client-thread-574" #574 prio=5 os_prio=0 tid=0x0000000003ef5000 nid=0x5cd0 waiting on condition [0x00007fcc073f4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000a30c67f0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.ambari.server.state.svccomphost.ServiceComponentHostImpl.convertToResponse(ServiceComponentHostImpl.java:1355) at org.apache.ambari.server.controller.AmbariManagementControllerImpl.getHostComponents(AmbariManagementControllerImpl.java:1299) at org.apache.ambari.server.controller.AmbariManagementControllerImpl.getHostComponents(AmbariManagementControllerImpl.java:3365) at org.apache.ambari.server.controller.internal.HostComponentResourceProvider$2.invoke(HostComponentResourceProvider.java:215) at org.apache.ambari.server.controller.internal.HostComponentResourceProvider$2.invoke(HostComponentResourceProvider.java:212) at org.apache.ambari.server.controller.internal.AbstractResourceProvider.getResources(AbstractResourceProvider.java:307) at org.apache.ambari.server.controller.internal.HostComponentResourceProvider.findResources(HostComponentResourceProvider.java:212) at org.apache.ambari.server.controller.internal.HostComponentResourceProvider.getResources(HostComponentResourceProvider.java:188) at org.apache.ambari.server.controller.internal.ClusterControllerImpl$ExtendedResourceProviderWrapper.queryForResources(ClusterControllerImpl.java:966) at org.apache.ambari.server.controller.internal.ClusterControllerImpl.getResources(ClusterControllerImpl.java:141)
Finally, a writer thread tries to start the upgrade and has to wait to get a write lock on the cluster:
"ambari-client-thread-507" #507 prio=5 os_prio=0 tid=0x0000000001db7000 nid=0x1e1a waiting on condition [0x00007fcc0d6cf000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000a2564b80> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943) at org.apache.ambari.server.state.cluster.ClusterImpl.setUpgradeEntity(ClusterImpl.java:3641) at org.apache.ambari.server.state.cluster.ClusterImpl$$EnhancerByGuice$$770b7b6b.CGLIB$setUpgradeEntity$35(<generated>) at org.apache.ambari.server.state.cluster.ClusterImpl$$EnhancerByGuice$$770b7b6b$$FastClassByGuice$$ef7ae837.invoke(<generated>) at com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at org.apache.ambari.server.orm.AmbariJpaLocalTxnInterceptor.invoke(AmbariJpaLocalTxnInterceptor.java:118) at com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) at com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) at org.apache.ambari.server.state.cluster.ClusterImpl$$EnhancerByGuice$$770b7b6b.setUpgradeEntity(<generated>) at org.apache.ambari.server.controller.internal.UpgradeResourceProvider.createUpgrade(UpgradeResourceProvider.java:987) at org.apache.ambari.server.controller.internal.UpgradeResourceProvider$$EnhancerByGuice$$be5dc6f2.CGLIB$createUpgrade$4(<generated>) at org.apache.ambari.server.controller.internal.UpgradeResourceProvider$$EnhancerByGuice$$be5dc6f2$$FastClassByGuice$$83a8554c.invoke(<generated>)
So, the problem is clearly that ambari-heartbeat-processor-0 is stuck reading.
Investigating on Postgres shows that there is a lock blocking the thread which is waiting:
blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_statement | current_statement_in_blocking_process |
---|---|---|---|---|---|
31705 | ambaricustomuser | 31660 | ambaricustomuser | UPDATE hostcomponentstate SET current_state = $1 WHERE (id = $2) | SELECT sequence_value FROM ambari_sequences WHERE sequence_name = $1 |
The sequence query is currently stuck in the idle in transaction state which is why it's blocking the other query. The transaction isn't being ended by EclipseLink.
Attachments
Attachments
Issue Links
- links to