Details
Description
When primary transaction service loses leadership, a call to stop Thrift server is made. Under heavy connection load the Thrift server can hang during stop, thus not allowing the leader to pass on the leadership to another transaction service process. This leads to transaction service becoming unresponsive to the clients.
Here are the sequence of events that can lead to this -
- Due to large number of connections, the AcceptThread of TThreadedSelectorServer blocks while trying to add a new connection to the accepted queue of a SelectorThread.
- The SelectorThreads are waiting for some transaction operation to complete.
- At this time if the service loses leadership, a call to stop Thrift server is made.
- TThreadedSelectorServer.stop() method sets stop flag to true, and wakes up selectors of AcceptThread and SelectorThreads.
- The SelectorThread on wakeup sees that the stop flag is true, exits without removing any more elements from its accepted queue.
- AcceptThread continues to block on the accepted queue, thus not allowing the shujavittdown sequence of ThriftRPCServer to proceed. This leads to leadership remaining with the current service that has partially shujavittdown, and makes the transaction service unresponsive.
Stacktrace when the Thrift server hangs -
"ThriftRPCServer" daemon prio=5 tid=0x00007fbb39157000 nid=0x6503 in Object.wait() [0x0000000115767000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000007af5ae380> (a org.apache.thrift.server.TThreadedSelectorServer$AcceptThread) at java.lang.Thread.join(Thread.java:1281) - locked <0x00000007af5ae380> (a org.apache.thrift.server.TThreadedSelectorServer$AcceptThread) at java.lang.Thread.join(Thread.java:1355) at org.apache.thrift.server.TThreadedSelectorServer.joinThreads(TThreadedSelectorServer.java:251) at org.apache.thrift.server.TThreadedSelectorServer.waitForShutdown(TThreadedSelectorServer.java:241) at org.apache.thrift.server.AbstractNonblockingServer.serve(AbstractNonblockingServer.java:94) at co.cask.tephra.rpc.ThriftRPCServer.run(ThriftRPCServer.java:210) at com.google.common.util.concurrent.AbstractExecutionThreadService$1$1.run(AbstractExecutionThreadService.java:52) at java.lang.Thread.run(Thread.java:745) "Thread-5" daemon prio=5 tid=0x00007fbb3a81e000 nid=0x7303 waiting on condition [0x0000000115e7c000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007af5ae3f8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:324) at org.apache.thrift.server.TThreadedSelectorServer$SelectorThread.addAcceptedConnection(TThreadedSelectorServer.java:520) at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.doAddAccept(TThreadedSelectorServer.java:462) at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.handleAccept(TThreadedSelectorServer.java:433) at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.select(TThreadedSelectorServer.java:413) at org.apache.thrift.server.TThreadedSelectorServer$AcceptThread.run(TThreadedSelectorServer.java:375)