Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-18075

Upgraded (C* 4.0.4) node stops communicating with older version (3.11.4) nodes during upgrade

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Triage Needed
    • Normal
    • Resolution: Unresolved
    • None
    • Feature/Encryption
    • None
    • All
    • None

    Description

      We are testing upgrade from Cassandra 3.11.4 to 4.0.4 on our test cluster which is SSL enabled and facing an issue.

      Our cluster size is 3x3.

      Datacenter: abssl_dev_tap_ttc
      =============================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
      UN  10.109.6.153   94.27 KiB  16           100.0%            130e59d2-2a9a-4039-a42f-deb20afcf288  rack1
      UN  10.109.45.8    104.43 KiB  16           100.0%            35274a2c-f915-4308-9981-d207a4e2108f  rack1
      UN  10.109.66.149  104.23 KiB  16           100.0%            ea0151bc-fb6c-425d-af42-75c10e52f941  rack1
      Datacenter: abssl_dev_tap_tte
      =============================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
      UN  10.110.4.110   104.44 KiB  16           100.0%            fd4a9fa8-f2a9-494c-afb8-7cb8a08c7554  rack1
      UN  10.110.44.220  99.33 KiB  16           100.0%            f1dc35c0-a1c2-45fe-9f65-b1cc3d7f6947  rack1
      UN  10.110.49.242  65.57 KiB  16           100.0%            72bc4ae5-876d-4d0a-91ac-6cf8b531b4dd  rack1
      
      dbaasprod-ca-abssl-de-393671-v001-yqlvf:~# nodetool describecluster
      Cluster Information:
      	Name: abssl_dev
      	Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
      	DynamicEndPointSnitch: enabled
      	Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
      	Schema versions:
      		f68fbc0c-c9d6-3709-8075-c5a0d74192f2: [10.110.4.110, 10.110.44.220, 10.109.6.153, 10.109.45.8, 10.109.66.149, 10.110.49.242]
      
      

      During the upgrade, we re-run the pipeline in which we get new server (with different IP) that will have Cassandra 4.0.4 binary.
      Disk '/data' (contains data files, commitlogs etc.) will get detached from the old server and get attached to the new server.

      This process works fine on non-SSL cluster but when we perform this on SSL cluster, new node stops communicating with the rest of the nodes.

      In this example, after upgrade, node 10.110.4.110 got replaced with new server with new IP 10.110.44.207.

      Output from 3.11.4 node:

      dbaasprod-ca-abssl-dc-437097-v001-7mump:~# hostname -i
      10.109.6.153
      dbaasprod-ca-abssl-dc-437097-v001-7mump:~# java -version
      openjdk version "1.8.0_322"
      OpenJDK Runtime Environment (Temurin)(build 1.8.0_322-b06)
      OpenJDK 64-Bit Server VM (Temurin)(build 25.322-b06, mixed mode)
      dbaasprod-ca-abssl-dc-437097-v001-7mump:~# nodetool status
      Datacenter: abssl_dev_tap_ttc
      =============================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
      UN  10.109.6.153   135.24 KiB  16           100.0%            130e59d2-2a9a-4039-a42f-deb20afcf288  rack1
      UN  10.109.45.8    135.35 KiB  16           100.0%            35274a2c-f915-4308-9981-d207a4e2108f  rack1
      UN  10.109.66.149  135.25 KiB  16           100.0%            ea0151bc-fb6c-425d-af42-75c10e52f941  rack1
      Datacenter: abssl_dev_tap_tte
      =============================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
      DN  10.110.4.110   104.44 KiB  16           100.0%            fd4a9fa8-f2a9-494c-afb8-7cb8a08c7554  rack1
      UN  10.110.44.220  104.44 KiB  16           100.0%            f1dc35c0-a1c2-45fe-9f65-b1cc3d7f6947  rack1
      UN  10.110.49.242  65.57 KiB  16           100.0%            72bc4ae5-876d-4d0a-91ac-6cf8b531b4dd  rack1
      
      dbaasprod-ca-abssl-dc-437097-v001-7mump:~# nodetool describecluster
      Cluster Information:
      	Name: abssl_dev
      	Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
      	DynamicEndPointSnitch: enabled
      	Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
      	Schema versions:
      		f68fbc0c-c9d6-3709-8075-c5a0d74192f2: [10.110.44.220, 10.109.6.153, 10.109.45.8, 10.109.66.149, 10.110.49.242]
      
      		UNREACHABLE: [10.110.4.110]
      

      Output from 4.0.4 node:

      dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# hostname -i
      10.110.44.207
      dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# java -version
      openjdk version "11.0.15" 2022-04-19
      OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10)
      OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode)
      dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# nodetool status
      Datacenter: DC1
      ===============
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --  Address        Load        Tokens  Owns (effective)  Host ID                               Rack
      DN  10.109.6.153   ?           16      0.0%              130e59d2-2a9a-4039-a42f-deb20afcf288  r1
      DN  10.109.45.8    ?           16      0.0%              35274a2c-f915-4308-9981-d207a4e2108f  r1
      DN  10.109.66.149  ?           16      0.0%              ea0151bc-fb6c-425d-af42-75c10e52f941  r1
      DN  10.110.44.220  ?           16      0.0%              f1dc35c0-a1c2-45fe-9f65-b1cc3d7f6947  r1
      DN  10.110.49.242  ?           16      0.0%              72bc4ae5-876d-4d0a-91ac-6cf8b531b4dd  r1
      
      Datacenter: abssl_dev_tap_tte
      =============================
      Status=Up/Down
      |/ State=Normal/Leaving/Joining/Moving
      --  Address        Load        Tokens  Owns (effective)  Host ID                               Rack
      UN  10.110.44.207  146.27 KiB  16      100.0%            fd4a9fa8-f2a9-494c-afb8-7cb8a08c7554  rack1
      
      dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# nodetool describecluster
      Cluster Information:
      	Name: abssl_dev
      	Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
      	DynamicEndPointSnitch: disabled
      	Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
      	Schema versions:
      		1ccaeb62-5816-3599-897f-de59fd56eef2: [10.110.44.207]
      
      		UNREACHABLE: [10.109.45.8, 10.109.66.149, 10.110.44.220, 10.109.6.153, 10.110.49.242]
      
      Stats for all nodes:
      	Live: 1
      	Joining: 0
      	Moving: 0
      	Leaving: 0
      	Unreachable: 5
      
      Data Centers:
      	DC1 #Nodes: 5 #Down: 0
      	abssl_dev_tap_tte #Nodes: 1 #Down: 0
      
      Database versions:
      	: [10.109.45.8:7000, 10.109.66.149:7000, 10.110.44.220:7000, 10.109.6.153:7000, 10.110.49.242:7000]
      
      	4.0.4: [10.110.44.207:7000]
      
      Keyspaces:
      	system_schema -> Replication class: LocalStrategy {}
      	system -> Replication class: LocalStrategy {}
      	system_auth -> Replication class: NetworkTopologyStrategy {abssl_dev_tap_tte=3, abssl_dev_tap_ttc=3}
      	system_distributed -> Replication class: NetworkTopologyStrategy {abssl_dev_tap_tte=3, abssl_dev_tap_ttc=3}
      	system_traces -> Replication class: NetworkTopologyStrategy {abssl_dev_tap_tte=3, abssl_dev_tap_ttc=3}
      
      

      Getting below error in system.log file of new node 10.110.44.207 which has Cassandra version 4.0.4.

      WARN  [Messaging-EventLoop-3-6] 2022-11-28 06:20:49,577 NoSpamLogger.java:95 - /10.110.44.207:7000->/10.109.45.8:7000-URGENT_MESSAGES-[no-channel] dropping message of type GOSSIP_DIGEST_SYN whose timeout expired before reaching the network
      INFO  [Messaging-EventLoop-3-6] 2022-11-28 06:21:17,921 NoSpamLogger.java:92 - /10.110.44.207:7000->/10.110.49.242:7000-URGENT_MESSAGES-[no-channel] failed to connect
      io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.110.49.242:7000
      Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
      	at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
      	at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
      	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
      	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
      	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
      	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
      	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
      	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
      	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
      	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      

      I am attaching the cassandra.yaml, cassandra-env.sh files from both versions (3.11.4 and 4.0.4).
      Also attaching the system.log file from upgraded node 10.110.44.207.

      It seems like some bug and hence raising this Jira. Can you please have a look?

      Let me know if you need any more details.

      Thanks,
      Alaykumar Barochia

      Attachments

        1. cassandra.yaml_10.110.44.207_explicitely_set_port
          2 kB
          Alaykumar Barochia
        2. cassandra.yaml_10.110.49.242_explicitely_set_port
          2 kB
          Alaykumar Barochia
        3. cassandra.yaml_3114
          2 kB
          Alaykumar Barochia
        4. cassandra.yaml_404
          2 kB
          Alaykumar Barochia
        5. cassandra-env.sh_3114
          3 kB
          Alaykumar Barochia
        6. cassandra-env.sh_404
          3 kB
          Alaykumar Barochia
        7. In-place-upgrade.zip
          25 kB
          Alaykumar Barochia
        8. system.log_10.110.44.207
          92 kB
          Alaykumar Barochia
        9. system.log_10.110.44.207_after_explicitely_set_port
          67 kB
          Alaykumar Barochia
        10. system.log_10.110.49.242_after_explicitely_set_port
          30 kB
          Alaykumar Barochia

        Activity

          People

            Unassigned Unassigned
            abarochia Alaykumar Barochia
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: