Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-21307

Call failure handler in case of failure in WatchProcessor

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.0.0-beta2
    • None

    Description

      For the linearized watch processing, we have WatchProcessor#notificationFuture that is rewritten for each revision processing and meta storage safe time advance. If some watch processor completes exceptionally, this means that no further updates will be processed, because they need the previous updates to be processed successfully. This is implemented in futures chaining like this:

       

      notificationFuture = notificationFuture
              .thenRunAsync(() -> revisionCallback.onSafeTimeAdvanced(time), watchExecutor)
              .whenComplete((ignored, e) -> {
                  if (e != null) {
                      LOG.error("Error occurred when notifying safe time advanced callback", e);
                  }
              }); 

      For now, we dont have any failure handing of exceptionally completed notification future. It leads to the endless log records with the same exception's stack trace, caused by meta storage safe time advances:

       

      [2024-01-16T21:42:35,515][ERROR][%isot_n_0%JRaft-FSMCaller-Disruptor-metastorage-_stripe_0-0][WatchProcessor] Error occurred when notifying safe time advanced callback
      java.util.concurrent.CompletionException: org.apache.ignite.internal.lang.IgniteInternalException: IGN-CMN-65535 TraceId:3877e098-6a1b-4f30-88a8-a4c13411d573 Peers are not ready [groupId=5_part_0]
          at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
          at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
          at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1081) ~[?:?]
          at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?]
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
          at java.lang.Thread.run(Thread.java:834) [?:?]
      Caused by: org.apache.ignite.internal.lang.IgniteInternalException: Peers are not ready [groupId=5_part_0]
          at org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:725) ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:709) ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.raft.RaftGroupServiceImpl.refreshLeader(RaftGroupServiceImpl.java:234) ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.raft.RaftGroupServiceImpl.start(RaftGroupServiceImpl.java:190) ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.start(TopologyAwareRaftGroupService.java:187) ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupServiceFactory.startRaftGroupService(TopologyAwareRaftGroupServiceFactory.java:73) ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.raft.Loza.startRaftGroupService(Loza.java:350) ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$27(TableManager.java:917) ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:827) ~[ignite-core-9.0.127-SNAPSHOT.jar:?]
          at org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$28(TableManager.java:913) ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
          at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072) ~[?:?]
          ... 4 more 

      So, the node can't operate properly and just produces tons of logs. Such nodes should be halted.

      UPD:

      We decided to just add invocation of failureProcessor.process in all places in org.apache.ignite.internal.metastorage.server.WatchProcessor where exceptions happen, like

                      notifyWatchFuture = watchAndEvents.watch.onUpdate(event)
                              .whenComplete((v, e) -> {
                                  if (e != null) {
                                      if (e instanceof CompletionException) {
                                          e = e.getCause();
                                      }
      
                                      if (!(e instanceof NodeStoppingException)) {
                                          // TODO: IGNITE-14693 Implement Meta storage exception handling
                                          LOG.error("Error occurred when processing a watch event", e);
                                      }
      
                                      watchAndEvents.watch.onError(e);
                                  }
                              });
      

       

      Attachments

        Issue Links

          Activity

            People

              slava.koptilin Vyacheslav Koptilin
              Denis Chudov Denis Chudov
              Denis Chudov Denis Chudov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m