Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-19419

Internal queue overload under intensive SQL load

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0-beta1
    • None
    • sql
    • Docs Required, Release Notes Required

    Description

      I tried to run a SQL benchmark (slightly modified Benchbase TPC-C scenario) against Ignite3 (commit hash b48ddcba7cd2bd3b9a053ae131c25b44a0400e27).

      Ignite3 was running in a single-node mode with -Xmx8G -Xms8G memory limit (increased memory size comparing to IGNITE-19412). After some time it has become unavailable to the client. Log is filled with error messages like the following:

       

      2023-05-03 17:51:57:705 +0300 [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> got error: Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue overload.]].
      2023-05-03 17:51:57:705 +0300 [WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error during the request type=ActionRequestImpl occurred (will be retried on the randomly selected node): 
      java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
      	at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
      	at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
      	at java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
      	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
      	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
      	at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.util.concurrent.TimeoutException
      	... 7 more
      2023-05-03 17:51:57:705 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_5/node0> append [2268, 2268] failed, status=Status[EIO<1014>: Corrupted LogStorage].
      2023-05-03 17:51:57:705 +0300 [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][FSMCallerImpl] FSMCaller already in error status, ignore new error
      Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue overload.]]
      	at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:587)
      	at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:337)
      	at org.apache.ignite.raft.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1621)
      	at org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:305)
      	at org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:288)
      	at org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:217)
      	at org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:181)
      	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      2023-05-03 17:51:57:706 +0300 [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][ReplicatorGroupImpl] Fail to find the next candidate.
      2023-05-03 17:51:57:706 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> append [2271, 2271] failed, status=Status[EIO<1014>: Corrupted LogStorage].
      2023-05-03 17:51:57:706 +0300 [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][StateMachineAdapter] onLeaderStop: status=Status[EBADNODE<10009>: Raft node(leader or candidate) is in error.].
      2023-05-03 17:51:57:706 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <30181391-2f04-40c0-8c35-a77592b0731f_part_5/node0> append [2285, 2285] failed, status=Status[EIO<1014>: Corrupted LogStorage].
      

      More errors could be observed in the attached log archive.

      Actually the log storage is not corrupted but the queue is overloaded. There are some performance improvements in main JRaft repository ( https://github.com/sofastack/sofa-jraft ) that may fix that, more exactly it is commit 5de2fbbcabb70ddbefb06a1d3737821781c3e85c in sofa-jraft repo.

      Attachments

        1. ai3.jstack
          383 kB
          Andrey Khitrin
        2. logs.zip
          1.28 MB
          Andrey Khitrin

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            akhitrin Andrey Khitrin

            Dates

              Created:
              Updated:

              Slack

                Issue deployment