Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
3.0
Description
Raft log on any network error consumes about 1Gb per node / 5 minutes on a 3-node cluster!
- Start 3 node cluter
- Start creating tables in a loop (create 50 tables, insert 1 rows into each)
- Kil 1 node
Expected result:
The cluster prints a few errors, updates the topology and continues operations.
Actual result:
Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:
- grep "[ReplicatorGroupImpl] Fail to check replicator connection to" ignite3db* | wc -l
2 423 492 - grep "[AbstractClientService] Fail to connect TablesAmountCapacityMultiNodeTest_cluster_1, exception: org.apache.ignite.internal.raft.PeerUnavailableException: Peer TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc -l
2 547 696
In just 9 minutes! In each node.
Implementation notes
See also PeerUnavailableException
Raft writes the mentioned lines to log each time when it fails to send any message to the killed node. It could remember the killed peer and check it on connection failure, like this:
// Volatile Collection<PeerId> deadPeers = new ArrayList<>(); if (!client.connect(peer)) { if (!deadPeers.contains(peer)) { LOG.error("Fail to check replicator connection to peer={}, replicatorType={}.", peer, replicatorType); deadPeers.add(peer); } this.failureReplicators.put(peer, replicatorType); return false; }
There are several places in the code to fix.
Definition of done
Raft writes only one message about each dead peer on a node.
Attachments
Issue Links
- links to