Details
-
Bug
-
Status: Open
-
Urgent
-
Resolution: Unresolved
-
None
-
- We run entirely a counter based workload the clusters we've been seeing this issue on (although it might happen for non counter clusters).
- We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load on each node) running on c4.4xlarge instances in EC2.
- We've since split this cluster into two cassandra 3.11.0 clusters (for other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge cluster.
- All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
We run entirely a counter based workload the clusters we've been seeing this issue on (although it might happen for non counter clusters). We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load on each node) running on c4.4xlarge instances in EC2. We've since split this cluster into two cassandra 3.11.0 clusters (for other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge cluster. All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
-
Availability - Unavailable
-
Critical
Description
Hello. For the last two months we've been fighting performance issues relating to node restarts and hint playback, and were able to get a pretty good bit of proof for the issue last night when debugging one of these restarts.
The main issue we've been fighting with these clusters is very slow and unstable node restarts which seem to be due to hint playback with logs indicating "Finished hinted handoff of file ... hints to endpoint: 1-2-3-4, partially". We've posted about this on the mailing list as well and this bug seems to be the cause of this issue https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E
Our process for restarting nodes is to run nodetool drain and then restart the service via init. When originally watching the logs we saw on the restarting node the standard cassandra startup process of initializing keyspaces, loading sstables, and finally starting to handshake with the cluster, which takes about 5 minutes. After this, the logs are very quiet until 15 minutes later.
During this 15 minute period, some peers (between 0 and half of the cluster) are reported as DN status by nodetool status.
When checking one of the nodes that is reported as DN, we see hint playback logging lines such as:
INFO [HintsDispatcher:12] 2017-10-31 01:06:18,876 HintsDispatchExecutor.java:289 - Finished hinted handoff of file 8724f664-dff1-4c20-887b-6a26ae54a9b5-1509410768866-1.hints to endpoint /20.0.131.175: 8724f664-dff1-4c20-887b-6a26ae54a9b5, partially
We traced the codepath from this log line, starting in HintsDispatcher. It appears the callback that is created for the hint sending is timing out as we verified this the JMX metrics for the HintsService:
org.apache.cassandra.metrics:name=HintsFailed,type=HintsService org.apache.cassandra.metrics:name=HintsSucceeded,type=HintsService org.apache.cassandra.metrics:name=HintsTimedOut,type=HintsService
in which HintsTimedOut was incrementing during this period. We suspected this might be due to the restarting node being overloaded, as we do see heavy IO on the restart. However that IO pattern is not for the complete duration of the "partial" hint playback period. We also generated a flame graph (using the jvm-tools stcap and ssa) of the restarting node during this time period, attached as 3.11_node_restart_without_disablegossip.svg. This seemed to confirm that the node isn't doing much during the idle period as threads are just parked.
What is interesting is checking the debug logs on one of the nodes that is marked as DN and is partially playing back hints. While the log lines for hint playback are not helpful, what lead us to this apparent bug is the debug logs for OutboundTcpConnection. Below is a pruned snippet of these logs:
[GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
[MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
[MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
[HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,795 OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,797 OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:47,899 OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:48,002 OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
... repeats until the node restarts gossip transport ...
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,703 OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:49:04,806 OutboundTcpConnection.java:545 - Unable to connect to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,545 OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
[HANDSHAKE-/20.0.131.175] 2017-10-31 00:50:46,547 OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:50:46,600 OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,376 OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed
[MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,377 OutboundTcpConnection.java:349 - Error writing to /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,382 OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175
[HANDSHAKE-/20.0.131.175] 2017-10-31 01:06:25,383 OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175
[MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 01:06:25,385 OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175
The timeline of events here is:
T 00:00 - restarting node closed the small + gossip socket to the hints host, and restarted T 00:00 - hints host then reattempted a connection and handshaked for small instantly after (node restarted) T 00:47 - T4:46 hints host repeatedly attempts outbound gossip connections for this time range hints are still playing back "partially" T 21:27 - hints hosts logs the previously (reattempted) small connection as errored (timeout) T 21:27 - hints host retries the connection T 21:27 - hints work, things are done.
Non pruned debug log snippet for the restarting node in the time period of the reconnection is:
debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 StorageService.java:1439 - DRAINING: starting drain process
debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,537 HintsService.java:220 - Paused hints dispatch
debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,578 Server.java:176 - Stop listening for CQL clients
debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,581 Gossiper.java:1530 - Announcing shutdown
debug.log:DEBUG [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 StorageService.java:2252 - Node /20.0.131.175 state shutdown, token [ .... token range .... ]
debug.log:INFO [RMI TCP Connection(54158)-127.0.0.1] 2017-10-31 00:46:00,582 StorageService.java:2255 - Node /20.0.131.175 state jump to shutdown
and on the hint playing host:
debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 FailureDetector.java:457 - Ignoring interval time of 2408381657 for /20.0.132.6 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:45:59,642 FailureDetector.java:457 - Ignoring interval time of 3000155164 for /20.0.131.252 debug.log.8.zip:INFO [GossipStage:1] 2017-10-31 00:46:00,586 Gossiper.java:1044 - InetAddress /20.0.131.175 is now DOWN debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:205 - Enqueuing socket close for /20.0.131.175 debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Gossip] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,586 OutboundTcpConnection.java:410 - Socket to /20.0.131.175 closed debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,586 FailureDetector.java:323 - Forcing conviction of /20.0.131.175 debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,598 OutboundTcpConnection.java:424 - Attempting to connect to /20.0.131.175 debug.log.8.zip:INFO [HANDSHAKE-/20.0.131.175] 2017-10-31 00:46:00,599 OutboundTcpConnection.java:560 - Handshaking version with /20.0.131.175 debug.log.8.zip:DEBUG [MessagingService-Outgoing-/20.0.131.175-Small] 2017-10-31 00:46:00,601 OutboundTcpConnection.java:532 - Done connecting to /20.0.131.175 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 FailureDetector.java:457 - Ignoring interval time of 2001490285 for /20.0.131.6 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 FailureDetector.java:457 - Ignoring interval time of 2001552636 for /20.0.132.8 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,642 FailureDetector.java:457 - Ignoring interval time of 2001576461 for /20.0.131.8 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 FailureDetector.java:457 - Ignoring interval time of 2001599351 for /20.0.130.18 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 FailureDetector.java:457 - Ignoring interval time of 2930107432 for /20.0.131.148 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 FailureDetector.java:457 - Ignoring interval time of 2001646958 for /20.0.130.154 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 FailureDetector.java:457 - Ignoring interval time of 2001665876 for /20.0.132.26 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 FailureDetector.java:457 - Ignoring interval time of 2001689764 for /20.0.130.31 debug.log.8.zip:DEBUG [GossipStage:1] 2017-10-31 00:46:00,643 FailureDetector.java:457 - Ignoring interval time of 2001624719 for /20.0.130.42
and lastly we generated a flame graph with the disablegossip step included, attached as 3.11_node_restart_with_disablegossip.svg
Based on this, it would appear that nodetool drain is correctly gossiping to its peers that it is shutting down, and the peers teardown their own gossip and small connections. The bug appears to be the fact that peers nearly instantly reconnect the small connection which then lingers around in a dead state until a 20 minute timeout kills it. The OutboundTcpConnection class does not seem to detect this condition, and OutboundTcpConnectionPool reuses this dead connection for hint playback which is causing the partially log lines and the JMX timeout metrics for the hints service.
We have worked around this issue by issuing a nodetool disablegossip after nodetool drain to further kill any of these connections and have not seen any issues on the 8 node restarts we've done since that process change. This has then caused the small connection to get killed correctly, which allows it to reconnect correctly once the node restarts and have hints playback correctly.
I have also not been able to recreate this issue on a smaller non production cluster or locally via CCM. It appears to be a two folded issue:
- On the peer nodes there seems to be some kind of race condition in the handling of of the gossip message indicating the restarting node is shutting down (resulting from nodetool drain) and whichever thread is using the small tcp connection.
- On the restarting node a new small connection is accepted within the duration of nodetool drain.
If there is anything else I should provide here, please let me know.
We've been discussing this with Jon Haddad as well.