Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-1891

Gap between logs cause service startup failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      This is the second raft gap problem reported by Guo Hao.

      2023-09-08 18:53:47,590 [Listener at test17/9860] ERROR org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter: SCM start failed with exception
      java.util.concurrent.CompletionException: java.lang.IllegalStateException: gap between start index 375 and first entry to append 377
              at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
              at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
              at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
              at java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
              at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
              at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
              at org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:191)
              at org.apache.ratis.util.ConcurrentUtils.lambda$null$4(ConcurrentUtils.java:180)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: java.lang.IllegalStateException: gap between start index 375 and first entry to append 377
              at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:60)
              at org.apache.ratis.server.raftlog.segmented.LogSegment.append(LogSegment.java:313)
              at org.apache.ratis.server.raftlog.segmented.LogSegment.lambda$loadSegment$2(LogSegment.java:165)
              at org.apache.ratis.server.raftlog.segmented.LogSegment.readSegmentFile(LogSegment.java:138)
              at org.apache.ratis.server.raftlog.segmented.LogSegment.loadSegment(LogSegment.java:164)
              at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogCache.loadSegment(SegmentedRaftLogCache.java:381)
              at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.loadLogSegments(SegmentedRaftLog.java:241)
              at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.openImpl(SegmentedRaftLog.java:214)
              at org.apache.ratis.server.raftlog.RaftLogBase.open(RaftLogBase.java:251)
              at org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:239)
              at org.apache.ratis.server.impl.ServerState.initRaftLog(ServerState.java:220)
              at org.apache.ratis.server.impl.ServerState.lambda$new$5(ServerState.java:161)
              at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62)
              at org.apache.ratis.server.impl.ServerState.initialize(ServerState.java:177)
              at org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:338)
              at org.apache.ratis.util.ConcurrentUtils.accept(ConcurrentUtils.java:188)
              ... 4 more 
      
      1. The gap server directory
      
      $ ll /home/work/ozone/scm.ha.ratis-storage-test4/1d823d9f-3e87-4790-85fc-f1a93f7845e5/current/
      total 4120
      -rw-rw-r-- 1 work work   14567 Sep  8 18:30 log_291-374
      -rw-rw-r-- 1 work work 4194304 Sep  8 18:30 log_inprogress_375
      -rw-rw-r-- 1 work work      50 Sep  8 18:30 raft-meta
      -rw-rw-r-- 1 work work     242 Sep  8 17:29 raft-meta.conf 
      
      

      The other two

      $ ll
      total 4168
      -rw-rw-r-- 1 work work      95 Sep  8 12:13 log_0-0
      -rw-rw-r-- 1 work work   39285 Sep  8 17:30 log_1-290
      -rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
      -rw-rw-r-- 1 work work     271 Sep  8 17:50 log_375-376
      -rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
      -rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
      -rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 
      
      
      $ ll
      total 4168
      -rw-rw-r-- 1 work work      95 Sep  8 13:15 log_0-0
      -rw-rw-r-- 1 work work   39285 Sep  8 17:29 log_1-290
      -rw-rw-r-- 1 work work   14567 Sep  8 17:35 log_291-374
      -rw-rw-r-- 1 work work     271 Sep  8 18:29 log_375-376
      -rw-rw-r-- 1 work work 4194304 Sep  8 19:01 log_inprogress_377
      -rw-rw-r-- 1 work work      86 Sep  8 18:29 raft-meta
      -rw-rw-r-- 1 work work     242 Sep  8 18:29 raft-meta.conf 
      
      

      Related Configurations:

      <property>
      <name>hdds.ratis.raft.server.log.unsafe-flush.enabled</name>
      <value>false</value>
      </property>
      <property>
      <name>hdds.ratis.raft.server.log.async-flush.enabled</name>
      <value>false</value>
      </property> 
      

      The scene in which the GAP occurs this time is as follows:
      1. shutdown scm, shutdown more than 60s timeout kill -9
      2. restart scm, this error occurs

      Attachments

        1. log_291-374
          14 kB
          LiMinyu
        2. log_inprogress_377
          4.00 MB
          LiMinyu
        3. log_375-376
          0.3 kB
          LiMinyu
        4. log_inprogress_375
          4.00 MB
          LiMinyu

        Issue Links

          Activity

            People

              Sammi Sammi Chen
              Sammi Sammi Chen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: