Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-14684

Stopping node at the end of checkpoint can cause "Critical system error"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.11
    • persistence
    • None
    • Fixed node fail due to deleting DurableBackgroundTask's at the end of a checkpoint when stopping a node.
    • Release Notes Required

    Description

      Checkpoint listener org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor#afterCheckpointEnd which trigger at the end of checkpoint process can not take checkpoint read lock during node stopping.

      Run test(see exception in log):org.apache.ignite.internal.processors.cache.persistence.db.LongDestroyDurableBackgroundTaskTest#testDestroyTaskLifecycle

      [2021-05-05 15:41:10,907][ERROR][db-checkpoint-thread-#87%db.LongDestroyDurableBackgroundTaskTest0%][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.IgniteException: Failed to perform cache update: node is stopping.]]
      class org.apache.ignite.IgniteException: Failed to perform cache update: node is stopping.
      	at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:127)
      	at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1583)
      	at org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.metaStorageOperation(DurableBackgroundTasksProcessor.java:335)
      	at org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.afterCheckpointEnd(DurableBackgroundTasksProcessor.java:152)
      	at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointWorkflow.markCheckpointEnd(CheckpointWorkflow.java:606)
      	at org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:479)
      	at org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.body(Checkpointer.java:282)
      	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: class org.apache.ignite.internal.NodeStoppingException: Failed to perform cache update: node is stopping.
      	... 9 more
      

      Attachments

        Issue Links

          Activity

            People

              ktkalenko@gridgain.com Kirill Tkalenko
              makedonskaya Maria Makedonskaya
              Ivan Bessonov Ivan Bessonov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m