Details
-
Bug
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
Availability
-
Normal
-
Normal
-
User Report
-
All
-
None
Description
Recently upgrade a 3.11.9 cluster to 4.0.4 , server are onprem, centos 7
All went well, the moment the last node came up, all hell broke loose.
Nodes starting seeing each other as down, gossip pending tasks was climbing in all nodes.
Eventually we discovered we had a lot of outdated seeds addresses in the yaml file (along with many good ones), after removing all the bad addresses, and many restarts, we were able to get all nodes up and running.
The bad seeds configuration was there all along in the 3.11 setup, but only once 4.0.4 was up on all nodes we had an issue.
INFO [Native-Transport-Requests-22] 2022-06-02 20:30:00,285 QueryProcessor.java:547 - Fully upgraded to at least 4.0.4
INFO [Messaging-EventLoop-3-18] 2022-06-02 20:30:00,314 OutboundConnection.java:1150 - /10.106.26.152:7000(/10.106.26.152:58984)->/10.110.48.16:7000-SMALL_MESSAGES-b7dbabb6 successfully connected, version = 12, framing = CRC, encryption = unencrypted
INFO [Messaging-EventLoop-3-3] 2022-06-02 20:30:00,412 InboundConnectionInitiator.java:464 - /10.110.48.16:7000(/10.110.48.16:60706)->/10.106.26.152:7000-SMALL_MESSAGES-8023af58 messaging connection established, version = 12, framing = CRC, encryption = unencrypted
INFO [GossipStage:1] 2022-06-02 20:30:02,214 Gossiper.java:1312 - InetAddress /10.110.48.16:7000 is now UP
INFO [Messaging-EventLoop-3-15] 2022-06-02 20:30:02,987 OutboundConnection.java:1150 - /10.106.26.152:7000(/10.106.26.152:38532)->/10.11.40.208:7000-SMALL_MESSAGES-3a3e215a successfully connected, version = 12, framing = LZ4, encryption = unencrypted
INFO [Messaging-EventLoop-3-4] 2022-06-02 20:30:03,137 InboundConnectionInitiator.java:464 - /10.21.26.123:7000(/10.21.26.123:58102)->/10.106.26.152:7000-SMALL_MESSAGES-6e8f7fd3 messaging connection established, version = 12, framing = LZ4, encryption = unencrypted
INFO [Messaging-EventLoop-3-5] 2022-06-02 20:30:03,262 InboundConnectionInitiator.java:464 - /10.11.40.208:7000(/10.11.40.208:40720)->/10.106.26.152:7000-SMALL_MESSAGES-ebd509fe messaging connection established, version = 12, framing = LZ4, encryption = unencrypted
INFO [GossipStage:1] 2022-06-02 20:30:03,339 StorageService.java:2785 - Node /10.110.48.16:7000 state jump to NORMAL
INFO [Messaging-EventLoop-3-6] 2022-06-02 20:30:04,533 InboundConnectionInitiator.java:464 - /10.40.0.135:7000(/10.40.0.135:43604)->/10.106.26.152:7000-SMALL_MESSAGES-72c1cc35 messaging connection established, version = 12, framing = LZ4, encryption = unencrypted
INFO [Messaging-EventLoop-3-6] 2022-06-02 20:30:05,011 OutboundConnection.java:1150 - /10.106.26.152:7000(/10.106.26.152:40678)->/10.106.48.97:7000-SMALL_MESSAGES-ad50944c successfully connected, version = 12, framing = CRC, encryption = unencrypted
INFO [Messaging-EventLoop-3-7] 2022-06-02 20:30:05,011 InboundConnectionInitiator.java:464 - /10.106.48.97:7000(/10.106.48.97:57368)->/10.106.26.152:7000-SMALL_MESSAGES-c98513ad messaging connection established, version = 12, framing = CRC, encryption = unencrypted
INFO [Messaging-EventLoop-3-8] 2022-06-02 20:30:14,783 NoSpamLogger.java:92 - /10.106.26.152:7000->/10.20.0.91:7000-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.ConnectTimeoutException: connection timed out: /10.20.0.91:7000
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
WARN [GossipTasks:1] 2022-06-02 20:30:41,944 Gossiper.java:1036 - Gossip stage has 1 pending tasks; skipping status check (no nodes will be marked down)
INFO [Service Thread] 2022-06-02 20:30:43,977 GCInspector.java:299 - GPGC New GC in 687ms. GenPauseless New Gen: 46198001664 -> 1505914880; GenPauseless Old Gen: 4413739008 -> 4769091584;
INFO [ScheduledTasks:1] 2022-06-02 20:30:44,700 MessagingMetrics.java:206 - READ_REQ messages were dropped in last 5000 ms: 34 internal and 1 cross node. Mean internal dropped latency: 252 ms and Mean cross-node dropped latency: 268 ms
WARN [GossipTasks:1] 2022-06-02 20:31:17,011 Gossiper.java:1036 - Gossip stage has 20 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:31:18,112 Gossiper.java:1036 - Gossip stage has 21 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:31:19,214 Gossiper.java:1036 - Gossip stage has 21 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:31:20,315 Gossiper.java:1036 - Gossip stage has 21 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:31:21,416 Gossiper.java:1036 - Gossip stage has 22 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:31:22,518 Gossiper.java:1036 - Gossip stage has 21 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:31:23,619 Gossiper.java:1036 - Gossip stage has 21 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:39:30,445 Gossiper.java:1036 - Gossip stage has 279 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:39:31,546 Gossiper.java:1036 - Gossip stage has 280 pending tasks; skipping status check (no nodes will be marked down)
WARN [MutationStage-96] 2022-06-02 20:39:31,962 NoSpamLogger.java:95 - /10.106.26.152:7000->/10.110.46.37:7000-SMALL_MESSAGES-60dee8f0 dropping message of type READ_RSP whose timeout expired before reaching the network
WARN [GossipTasks:1] 2022-06-02 20:39:32,648 Gossiper.java:1036 - Gossip stage has 279 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:39:33,749 Gossiper.java:1036 - Gossip stage has 281 pending tasks; skipping status check (no nodes will be marked down)
WARN [GossipTasks:1] 2022-06-02 20:39:34,850 Gossiper.java:1036 - Gossip stage has 284 pending tasks; skipping status check (no nodes will be marked down)
Half an hour later
WARN [GossipTasks:1] 2022-06-02 21:08:52,338 Gossiper.java:1036 - Gossip stage has 3697 pending tasks; skipping status check (no nodes will be marked down)
At the same time in debug log (10.20.0.91 was bad seed)
DEBUG [Native-Transport-Requests-10] 2022-06-02 20:30:36,900 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-29] 2022-06-02 20:30:36,914 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-28] 2022-06-02 20:30:36,931 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-41] 2022-06-02 20:30:36,969 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-18] 2022-06-02 20:30:36,977 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-1] 2022-06-02 20:30:37,007 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-11] 2022-06-02 20:30:37,028 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-37] 2022-06-02 20:30:37,034 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-47] 2022-06-02 20:30:37,038 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-38] 2022-06-02 20:30:37,061 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-12] 2022-06-02 20:30:37,066 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-61] 2022-06-02 20:30:37,069 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-58] 2022-06-02 20:30:37,112 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-10] 2022-06-02 20:30:37,117 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-24] 2022-06-02 20:30:37,124 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-30] 2022-06-02 20:30:37,127 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-28] 2022-06-02 20:30:37,147 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-17] 2022-06-02 20:30:37,158 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-16] 2022-06-02 20:30:37,191 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-21] 2022-06-02 20:30:37,204 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-44] 2022-06-02 20:30:37,232 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-31] 2022-06-02 20:30:37,232 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-43] 2022-06-02 20:30:37,243 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-25] 2022-06-02 20:30:37,253 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-29] 2022-06-02 20:30:37,261 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-5] 2022-06-02 20:31:15,803 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-6] 2022-06-02 20:31:15,826 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-32] 2022-06-02 20:31:15,840 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-28] 2022-06-02 20:31:15,867 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-51] 2022-06-02 20:31:15,875 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
WARN [GossipTasks:1] 2022-06-02 20:31:15,909 Gossiper.java:1036 - Gossip stage has 19 pending tasks; skipping status check (no nodes will be marked down)
DEBUG [Native-Transport-Requests-16] 2022-06-02 20:31:15,931 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-46] 2022-06-02 20:31:15,960 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
INFO [Messaging-EventLoop-3-8] 2022-06-02 20:31:16,005 NoSpamLogger.java:92 - /10.106.26.152:7000->/10.20.0.91:7000-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.ConnectTimeoutException: connection timed out: /10.20.0.91:7000
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
DEBUG [Native-Transport-Requests-13] 2022-06-02 20:31:16,092 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-24] 2022-06-02 20:31:16,104 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-9] 2022-06-02 20:31:16,109 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-31] 2022-06-02 20:31:16,113 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-48] 2022-06-02 20:31:16,134 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-3] 2022-06-02 20:31:16,140 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-25] 2022-06-02 20:31:16,155 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-45] 2022-06-02 20:31:16,155 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-70] 2022-06-02 20:31:16,162 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-10] 2022-06-02 20:31:16,773 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
DEBUG [Native-Transport-Requests-2] 2022-06-02 20:31:16,774 ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)