Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
This is the second raft gap problem reported by Guo Hao.
2023-09-08 18:53:47,590 [Listener at test17/9860] ERROR org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start failed with exception java.util.concurrent.CompletionException: java.lang.IllegalStateException: gap between start index 375 and first entry to append 377 at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284) at java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:191) at org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalStateException: gap between start index 375 and first entry to append 377 at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60) at org.apache.ratis.server.raftlog.segmented.LogSegment.append(LogSegment.java:313) at org.apache.ratis.server.raftlog.segmented.LogSegment.lambda$loadSegment$2(LogSegment.java:165) at org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138) at org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:164) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214) at org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251) at org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:239) at org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:220) at org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:161) at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62) at org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:177) at org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338) at org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188) ... 4 more
- The gap server directory
$ ll /home/work/ozone/scm.ha.ratis-storage-test4/1d823d9f-3e87-4790-85fc-f1a93f7845e5/current/ total 4120 -rw-rw-r-- 1 work work 14567 Sep 8 18:30 log_291-374 -rw-rw-r-- 1 work work 4194304 Sep 8 18:30 log_inprogress_375 -rw-rw-r-- 1 work work 50 Sep 8 18:30 raft-meta -rw-rw-r-- 1 work work 242 Sep 8 17:29 raft-meta.conf
The other two
$ ll total 4168 -rw-rw-r-- 1 work work 95 Sep 8 12:13 log_0-0 -rw-rw-r-- 1 work work 39285 Sep 8 17:30 log_1-290 -rw-rw-r-- 1 work work 14567 Sep 8 17:35 log_291-374 -rw-rw-r-- 1 work work 271 Sep 8 17:50 log_375-376 -rw-rw-r-- 1 work work 4194304 Sep 8 19:01 log_inprogress_377 -rw-rw-r-- 1 work work 86 Sep 8 18:29 raft-meta -rw-rw-r-- 1 work work 242 Sep 8 18:29 raft-meta.conf
$ ll total 4168 -rw-rw-r-- 1 work work 95 Sep 8 13:15 log_0-0 -rw-rw-r-- 1 work work 39285 Sep 8 17:29 log_1-290 -rw-rw-r-- 1 work work 14567 Sep 8 17:35 log_291-374 -rw-rw-r-- 1 work work 271 Sep 8 18:29 log_375-376 -rw-rw-r-- 1 work work 4194304 Sep 8 19:01 log_inprogress_377 -rw-rw-r-- 1 work work 86 Sep 8 18:29 raft-meta -rw-rw-r-- 1 work work 242 Sep 8 18:29 raft-meta.conf
Related Configurations:
<property> <name>hdds.ratis.raft.server.log.unsafe-flush.enabled</name> <value>false</value> </property> <property> <name>hdds.ratis.raft.server.log.async-flush.enabled</name> <value>false</value> </property>
The scene in which the GAP occurs this time is as follows:
1. shutdown scm, shutdown more than 60s timeout kill -9
2. restart scm, this error occurs
Attachments
Attachments
Issue Links
- relates to
-
RATIS-1887 Gap between segement log
- Resolved