Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
There is a cluster with 3 nodes and 1200 partitions in total (400 per node). When the cluster is restarted, each node recovers the Metastorage successfully, its leader is elected, then partitions recovery is started. This results in a lot of exceptions like the following in logs:
2024-11-08 13:23:28:845 +0000 [INFO][%node1%tableManager-io-15][NodeImpl] Node <48_part_3/node1> start vote and grant vote self, term=1. 2024-11-08 13:23:28:846 +0000 [ERROR][%node1%Raft-Group-Client-14][RebalanceUtil] Exception on updating assignments for [tableId=38, name=INVENTORY, partition=23] java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException: Send with retry timed out [retryCount = 7, groupId = metastorage_group, traceId = 5f329100-3de7-4ab8-a796-9969b7b91b22]. at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:559) at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$scheduleRetry$40(RaftGroupServiceImpl.java:750) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)
Also, there is another stack trace:
2024-11-08 13:27:03:523 +0000 [WARNING][%node1%rebalance-scheduler-11][RebalanceRaftGroupEventsListener] Unable to start rebalance [tablePartitionId, term=44_part_45]
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Send with retry timed out [retryCount = 7, groupId = metastorage_group, traceId = d52b447e-3c40-4f4b-9c67-863be811b0cb].
at java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture.get(Unknown Source) at org.apache.ignite.internal.distributionzones.rebalance.RebalanceRaftGroupEventsListener.lambda$onLeaderElected$0(RebalanceRaftGroupEventsListener.java:167)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.TimeoutException: Send with retry timed out [retryCount = 7, groupId = metastorage_group, traceId = d52b447e-3c40-4f4b-9c67-863be811b0cb]. at org.apache.ignite.internal.raft.RaftGroupServiceImpl.sendWithRetry(RaftGroupServiceImpl.java:559)
at org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$scheduleRetry$40(RaftGroupServiceImpl.java:750) ... 6 more
It seems that an avalanche of Metastorage accesses by hundreds of starting partitions overloads the Metastorage leader, so recovery fails with TimeoutExceptions.
We could probably solve this by establishing some kind of rate limiting on Metastorage accesses. We could implement this just for the recovery procedure or for normal operation as well.
High-priority accesses (Metastorage SafeTime propagation, Lease updates) should not be subject to rate limiting.
Attachments
Issue Links
- relates to
-
IGNITE-23597 Idle Raft Replicator eats too much CPU
- Open