Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-23395

Raft subsystem spams to log with network exceptions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.0
    • 3.0
    • networking, persistence

    Description

      Raft log on any network error consumes about 1Gb per node / 5 minutes on a 3-node cluster!

      1. Start 3 node cluter
      2. Start creating tables in a loop (create 50 tables, insert 1 rows into each)
      3. Kil 1 node

      Expected result:

      The cluster prints a few errors, updates the topology and continues operations.

      Actual result:

      Logs in two remaining nodes contains 20*100Mb files with similar ERRORs:

      • grep  "[ReplicatorGroupImpl] Fail to check replicator connection to" ignite3db* | wc -l
        2 423 492
      • grep  "[AbstractClientService] Fail to connect TablesAmountCapacityMultiNodeTest_cluster_1, exception: org.apache.ignite.internal.raft.PeerUnavailableException: Peer TablesAmountCapacityMultiNodeTest_cluster_1 is unavailable." ignite3db* | wc -l
        2 547 696

      In just 9 minutes! In each node.

      Implementation notes

      See also PeerUnavailableException
      Raft writes the mentioned lines to log each time when it fails to send any message to the killed node. It could remember the killed peer and check it on connection failure, like this:

              // Volatile
              Collection<PeerId> deadPeers = new ArrayList<>();
      
              if (!client.connect(peer)) {
                   if (!deadPeers.contains(peer)) {
                      LOG.error("Fail to check replicator connection to peer={}, replicatorType={}.", peer, replicatorType);
                      deadPeers.add(peer);
                   }
                  this.failureReplicators.put(peer, replicatorType);
                  return false;
              } 

      There are several places in the code to fix.

      Definition of done
      Raft writes only one message about each dead peer on a node.

      Attachments

        Issue Links

          Activity

            People

              Denis Chudov Denis Chudov
              Berkov Alexander Belyak
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m