[IGNITE-19419] Internal queue overload under intensive SQL load - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0-beta1
Fix Version/s: None
Component/s: sql
Labels:
- ignite-3
- storage-engine

Ignite Flags:

Docs Required, Release Notes Required

Description

I tried to run a SQL benchmark (slightly modified Benchbase TPC-C scenario) against Ignite3 (commit hash b48ddcba7cd2bd3b9a053ae131c25b44a0400e27).

Ignite3 was running in a single-node mode with -Xmx8G -Xms8G memory limit (increased memory size comparing to ~~IGNITE-19412~~). After some time it has become unavailable to the client. Log is filled with error messages like the following:

2023-05-03 17:51:57:705 +0300 [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> got error: Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue overload.]].
2023-05-03 17:51:57:705 +0300 [WARNING][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error during the request type=ActionRequestImpl occurred (will be retried on the randomly selected node): 
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
	at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
	at java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException
	... 7 more
2023-05-03 17:51:57:705 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_5/node0> append [2268, 2268] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-05-03 17:51:57:705 +0300 [WARNING][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][FSMCallerImpl] FSMCaller already in error status, ignore new error
Error [type=ERROR_TYPE_LOG, status=Status[EBUSY<1009>: LogManager is busy, disk queue overload.]]
	at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.reportError(LogManagerImpl.java:587)
	at org.apache.ignite.raft.jraft.storage.impl.LogManagerImpl.appendEntries(LogManagerImpl.java:337)
	at org.apache.ignite.raft.jraft.core.NodeImpl.executeApplyingTasks(NodeImpl.java:1621)
	at org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:305)
	at org.apache.ignite.raft.jraft.core.NodeImpl$LogEntryAndClosureHandler.onEvent(NodeImpl.java:288)
	at org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:217)
	at org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:181)
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
	at java.base/java.lang.Thread.run(Thread.java:829)
2023-05-03 17:51:57:706 +0300 [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][ReplicatorGroupImpl] Fail to find the next candidate.
2023-05-03 17:51:57:706 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <e92d1429-965a-4cbd-af37-126844f352ca_part_18/node0> append [2271, 2271] failed, status=Status[EIO<1014>: Corrupted LogStorage].
2023-05-03 17:51:57:706 +0300 [INFO][%node0%JRaft-FSMCaller-Disruptor-_stripe_10-0][StateMachineAdapter] onLeaderStop: status=Status[EBADNODE<10009>: Raft node(leader or candidate) is in error.].
2023-05-03 17:51:57:706 +0300 [ERROR][%node0%JRaft-LogManager-Disruptor-_stripe_10-0][NodeImpl] Node <30181391-2f04-40c0-8c35-a77592b0731f_part_5/node0> append [2285, 2285] failed, status=Status[EIO<1014>: Corrupted LogStorage].

More errors could be observed in the attached log archive.

Actually the log storage is not corrupted but the queue is overloaded. There are some performance improvements in main JRaft repository ( https://github.com/sofastack/sofa-jraft ) that may fix that, more exactly it is commit 5de2fbbcabb70ddbefb06a1d3737821781c3e85c in sofa-jraft repo.