[IGNITE-23395] Raft subsystem spams to log with network exceptions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 3.0
Fix Version/s: 3.0
Component/s: networking, persistence
Labels:
- ignite-3

Description

Raft log on any network error consumes about 1Gb per node / 5 minutes on a 3-node cluster!

Start 3 node cluter
Start creating tables in a loop (create 50 tables, insert 1 rows into each)
Kil 1 node

Expected result:

The cluster prints a few errors, updates the topology and continues operations.

Actual result:

Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:

grep "[ReplicatorGroupImpl] Fail to check replicator connection to" ignite3db* | wc -l
2 423 492
grep "[AbstractClientService] Fail to connect TablesAmountCapacityMultiNodeTest_cluster_1, exception: org.apache.ignite.internal.raft.PeerUnavailableException: Peer TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc -l
2 547 696

In just 9 minutes! In each node.

Implementation notes

See also PeerUnavailableException
Raft writes the mentioned lines to log each time when it fails to send any message to the killed node. It could remember the killed peer and check it on connection failure, like this:

        // Volatile
        Collection<PeerId> deadPeers = new ArrayList<>();

        if (!client.connect(peer)) {
             if (!deadPeers.contains(peer)) {
                LOG.error("Fail to check replicator connection to peer={}, replicatorType={}.", peer, replicatorType);
                deadPeers.add(peer);
             }
            this.failureReplicators.put(peer, replicatorType);
            return false;
        }

There are several places in the code to fix.

Definition of done
Raft writes only one message about each dead peer on a node.

Attachments

Issue Links

links to

GitHub Pull Request #4761

Activity

People

Assignee:: Denis Chudov

Reporter:: Alexander Belyak

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 10/Oct/24 08:13

Updated:: Yesterday 13:15

Resolved:: Yesterday 13:15

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m