We run entirely a counter based workload the clusters we've been seeing this issue on (although it might happen for non counter clusters). We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load on each node) running on c4.4xlarge instances in EC2. We've since split this cluster into two cassandra 3.11.0 clusters (for other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge cluster. All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
- We run entirely a counter based workload the clusters we've been seeing this issue on (although it might happen for non counter clusters).
- We originally had a 48 node cassandra 3.7 cluster (averaging ~500GB of load on each node) running on c4.4xlarge instances in EC2.
- We've since split this cluster into two cassandra 3.11.0 clusters (for other performance reasons), one with 1/3 of our dataset on 48 c4.4xlarge instances, and the other 2/3 of our data set onto a 96 node c4.4xlarge cluster.
- All of these clusters run on ubuntu 14.04 with enhanced networking enabled.
Availability - Unavailable
Hello. For the last two months we've been fighting performance issues relating to node restarts and hint playback, and were able to get a pretty good bit of proof for the issue last night when debugging one of these restarts.
The main issue we've been fighting with these clusters is very slow and unstable node restarts which seem to be due to hint playback with logs indicating "Finished hinted handoff of file ... hints to endpoint: 1-2-3-4, partially". We've posted about this on the mailing list as well and this bug seems to be the cause of this issue https://lists.apache.org/thread.html/41e9cb7b20696dd177fe925aba30d372614fcfe11d98275c7d9579cc@%3Cuser.cassandra.apache.org%3E
Our process for restarting nodes is to run nodetool drain and then restart the service via init. When originally watching the logs we saw on the restarting node the standard cassandra startup process of initializing keyspaces, loading sstables, and finally starting to handshake with the cluster, which takes about 5 minutes. After this, the logs are very quiet until 15 minutes later.
During this 15 minute period, some peers (between 0 and half of the cluster) are reported as DN status by nodetool status.
When checking one of the nodes that is reported as DN, we see hint playback logging lines such as:
We traced the codepath from this log line, starting in HintsDispatcher. It appears the callback that is created for the hint sending is timing out as we verified this the JMX metrics for the HintsService:
in which HintsTimedOut was incrementing during this period. We suspected this might be due to the restarting node being overloaded, as we do see heavy IO on the restart. However that IO pattern is not for the complete duration of the "partial" hint playback period. We also generated a flame graph (using the jvm-tools stcap and ssa) of the restarting node during this time period, attached as 3.11_node_restart_without_disablegossip.svg. This seemed to confirm that the node isn't doing much during the idle period as threads are just parked.
What is interesting is checking the debug logs on one of the nodes that is marked as DN and is partially playing back hints. While the log lines for hint playback are not helpful, what lead us to this apparent bug is the debug logs for OutboundTcpConnection. Below is a pruned snippet of these logs:
The timeline of events here is:
Non pruned debug log snippet for the restarting node in the time period of the reconnection is:
and on the hint playing host:
and lastly we generated a flame graph with the disablegossip step included, attached as 3.11_node_restart_with_disablegossip.svg
Based on this, it would appear that nodetool drain is correctly gossiping to its peers that it is shutting down, and the peers teardown their own gossip and small connections. The bug appears to be the fact that peers nearly instantly reconnect the small connection which then lingers around in a dead state until a 20 minute timeout kills it. The OutboundTcpConnection class does not seem to detect this condition, and OutboundTcpConnectionPool reuses this dead connection for hint playback which is causing the partially log lines and the JMX timeout metrics for the hints service.
We have worked around this issue by issuing a nodetool disablegossip after nodetool drain to further kill any of these connections and have not seen any issues on the 8 node restarts we've done since that process change. This has then caused the small connection to get killed correctly, which allows it to reconnect correctly once the node restarts and have hints playback correctly.
I have also not been able to recreate this issue on a smaller non production cluster or locally via CCM. It appears to be a two folded issue:
- On the peer nodes there seems to be some kind of race condition in the handling of of the gossip message indicating the restarting node is shutting down (resulting from nodetool drain) and whichever thread is using the small tcp connection.
- On the restarting node a new small connection is accepted within the duration of nodetool drain.
If there is anything else I should provide here, please let me know.
We've been discussing this with rustyrazorblade as well.