Details
-
Bug
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
None
-
All
-
None
Description
I have a Docker swarm cluster with 3 distinct Cassandra services (named cassandra7, cassandra8, cassandra9) running on 3 different servers. The 3 services are running the version 3.11.16, using the official Cassandra image 3.11.16 on Docker Hub. The first service is configured just with the following environment variables
CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7" CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9"
which in turn, at startup, modifies the cassandra.yaml. So for instance the cassandra.yaml for the first service contains the following (and the rest is the image default):
# grep tasks /etc/cassandra/cassandra.yaml
- seeds: "tasks.cassandra7,tasks.cassandra9"
listen_address: tasks.cassandra7
broadcast_address: tasks.cassandra7
broadcast_rpc_address: tasks.cassandra7
Other services (8 and 9) have a similar configuration, obviously with a different CASSANDRA_LISTEN_ADDRESS }}({{{}tasks.cassandra8 and tasks.cassandra9).
The cluster is running smoothly and all the nodes are perfectly able to rejoin the cluster whichever event occurs, thanks to the Docker Swarm tasks.cassandraXXX "hostname": i can kill a Docker container waiting for Docker swarm to restart it, force update it in order to force a restart, scale to 0 and then 1 the service, restart an entire server, turn off and then turn on all the 3 servers. Never found an issue on this.
I also just completed a full upgrade of the cluster from version 2.2.8 to 3.11.16 (simply upgrading the Docker official image associated with the services) without issues. I was also able, thanks to a 2.2.8 snapshot on each server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I finally issued a nodetool upgradesstables on all nodes, so my SSTables have now the me-* prefix.
The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The procedure that I follow is very simple:
- I start from the cassandra7 service (which is a seed node)
- nodetool drain
- Wait for the DRAINING ... DRAINED messages to appear in the log
- Upgrade the Docker image of cassandra7 to the official 4.1.3 version
The procedure is exactly the same I followed for the upgrade 2.2.8 --> 3.11.16, obviously with a different version at step 4. Unfortunately the upgrade 3.x --> 4.x is not working, the cassandra7 service restarts and attempts to communicate with the other seed node (cassandra9) but the log of cassandra7 shows the following:
INFO [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 OutboundConnectionInitiator.java:390 - Failed to connect to peer tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000) io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
The relevant port of the log, related to the missing internode communication, is attached in cassandra7.log
In the log of cassandra9 there is nothing after the abovementioned step #4. So only cassandra7 is saying something in the logs.
I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is always the same. Of course when I follow the steps 1..3, then restore the 3.x snapshot and finally perform the step #4 using the official 3.11.16 version the node 7 restarts correctly and joins the cluster. I attached the relevant part of the log (see cassandra7.downgrade.log) where you can see that node 7 and 9 can communicate.
I suspect this could be related to the port 7000 now (with Cassandra 4.x) supporting both encrypted and unencrypted traffic. As stated previously I'm using the untouched official Cassandra images so all my cluster, inside the Docker Swarm, is not (and has never been) configured with encryption.
I can also add the following: if I perform the 4 above steps also for the cassandra9 and cassandra8 services, in the end the cluster works. But this is not acceptable, because the cluster is unavailable until I finish the full upgrade of all nodes: I need to perform a step-update, one node after the other, where only 1 node is temporarily down and the other N-1 stay up.
Any idea on how to further investigate the issue? Thanks