Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-19419

Internal queue overload under intensive SQL load

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0-beta1
    • None
    • sql
    • Docs Required, Release Notes Required

    Description

      I tried to run a SQL benchmark (slightly modified Benchbase TPC-C scenario) against Ignite3 (commit hash b48ddcba7cd2bd3b9a053ae131c25b44a0400e27).

      Ignite3 was running in a single-node mode with -Xmx8G -Xms8G memory limit (increased memory size comparing to IGNITE-19412). After some time it has become unavailable to the client. Log is filled with error messages like the following:

       

      2023-05-03 17:51:57:705 +0300 [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> got error: Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue overload.]].
      2023-05-03 17:51:57:705 +0300 [WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error during the request type=ActionRequestImpl occurred (will be retried on the randomly selected node): 
      java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
      	at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
      	at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
      	at java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
      	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
      	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
      	at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.util.concurrent.TimeoutException
      	... 7 more
      2023-05-03 17:51:57:705 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_5/node0> append [2268, 2268] failed, status=Status[EIO<1014>: Corrupted LogStorage].
      2023-05-03 17:51:57:705 +0300 [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][FSMCallerImpl] FSMCaller already in error status, ignore new error
      Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue overload.]]
      	at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:587)
      	at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:337)
      	at org.apache.ignite.raft.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1621)
      	at org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:305)
      	at org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:288)
      	at org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:217)
      	at org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:181)
      	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      2023-05-03 17:51:57:706 +0300 [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][ReplicatorGroupImpl] Fail to find the next candidate.
      2023-05-03 17:51:57:706 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> append [2271, 2271] failed, status=Status[EIO<1014>: Corrupted LogStorage].
      2023-05-03 17:51:57:706 +0300 [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][StateMachineAdapter] onLeaderStop: status=Status[EBADNODE<10009>: Raft node(leader or candidate) is in error.].
      2023-05-03 17:51:57:706 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <30181391-2f04-40c0-8c35-a77592b0731f_part_5/node0> append [2285, 2285] failed, status=Status[EIO<1014>: Corrupted LogStorage].
      

      More errors could be observed in the attached log archive.

      Actually the log storage is not corrupted but the queue is overloaded. There are some performance improvements in main JRaft repository ( https://github.com/sofastack/sofa-jraft ) that may fix that, more exactly it is commit 5de2fbbcabb70ddbefb06a1d3737821781c3e85c in sofa-jraft repo.

      Attachments

        1. ai3.jstack
          383 kB
          Andrey Khitrin
        2. logs.zip
          1.28 MB
          Andrey Khitrin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              akhitrin Andrey Khitrin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: