Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-19239

Checkpoint read lock acquisition timeouts during snapshot restore

Agile BoardAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      There may be possible error messages about checkpoint read lock acquisition timeouts and critical threads blocking during snapshot restore process (just after caches start):

      [2023-04-06T10:55:46,561][ERROR][ttl-cleanup-worker-#475%node%][CheckpointTimeoutLock] Checkpoint read lock acquisition has been timed out.

      [2023-04-06T10:55:47,487][ERROR][tcp-disco-msg-worker-[crd]-#23%node%-#446%node%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=db-checkpoint-thread, threadName=db-checkpoint-thread-#457%snapshot.BlockingThreadsOnSnapshotRestoreReproducerTest0%, blockedFor=100s]

      Also there are active exchange process, which finishes with such timings (timing will be approximatelly equal to blocking time of threads):

      [2023-04-06T10:55:52,211][INFO ][exchange-worker-#450%node%][GridDhtPartitionsExchangeFuture] Exchange timings [startVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], resVer=AffinityTopologyVersion [topVer=1, minorTopVer=5], stage="Waiting in exchange queue" (0 ms), ..., stage="Restore partition states" (100163 ms), ..., stage="Total time" (100334 ms)]

       
      Most of time such errors and long lasting threads blocking tells that cluster is in emergency state or will crash very soon.

      So, there are two possible ways to solve problem:

      1. If these errors do not affect restoring from snapshot and are false positive ones, they can confuse, so we should remove them from logs.
      2. If these errors are not false positive, root cause of them have to be investigated and solved.

       

      How to reproduce:

      1. Set checkpoint frequency less than failure detection timeout.
      2. Ensure, that cache groups partitions states restoring lasts more than failure detection timeout, i.e. it is actual to sufficiently large caches.

      Reproducer: BlockingThreadsOnSnapshotRestoreReproducerTest.patch

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            timonin.maksim Maksim Timonin
            shishkovilja Ilya Shishkov

            Dates

              Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 10m
              10m

              Slack

                Issue deployment