Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-982

Fix RaftServerImpl illegal transition from RUNNING to RUNNING

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • None
    • None

    Description

      This happens in test, but it maybe also happen in production.

      For example, leader is s3 and follower is s4.
      1. kill s4, and restart s4.

      2020-06-19 07:03:18,095 [Thread-6194] INFO  ratis.MiniRaftCluster (MiniRaftCluster.java:killServer(458)) - killServer s4
      2020-06-19 07:03:18,095 [Thread-6194] INFO  ratis.MiniRaftCluster (MiniRaftCluster.java:newRaftServer(330)) - newRaftServer: s4, group-5BD7E8A01610:[s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], format? false
      

      2. s4 start and set configuration from storage at setRaftConf(raftConf.getLogEntryIndex(), raftConf) and s4 will change to RUNNING at lifeCycle.transition(RUNNING)

      2020-06-19 07:03:18,127 [pool-16-thread-1] INFO  impl.RaftServerImpl (ServerState.java:setRaftConf(356)) - s4@group-5BD7E8A01610: set configuration 0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], old=null at 0
      2020-06-19 07:03:18,153 [Thread-6194] INFO  impl.RaftServerImpl (RaftServerImpl.java:start(185)) - s4@group-5BD7E8A01610: start as a follower, conf=0: [s3:0.0.0.0:43375, s4:0.0.0.0:33719, s0:0.0.0.0:34867, s1:0.0.0.0:33783, s2:0.0.0.0:40473], old=null
      2020-06-19 07:03:18,153 [Thread-6194] INFO  impl.RaftServerImpl (RaftServerImpl.java:setRole(174)) - s4@group-5BD7E8A01610: changes role from      null to FOLLOWER at term 1 for startAsFollower
      

      3. s3 send append entry request to s4, and s4 change to RUNNING at lifeCycle.compareAndTransition(STARTING, RUNNING)

      2020-06-19 07:03:18,162 [nioEventLoopGroup-59-1] DEBUG impl.RaftServerImpl (RaftServerImpl.java:logAppendEntries(918)) - s4@group-5BD7E8A01610: receive appendEntries(s3, 1, (t:1, i:0), 0, false, commits[s3:c0, s4:c0, s0:c0, s1:c0, s2:c0], entries: (t:1, i:1), STATEMACHINELOGENTRY, client-9414EC4E73DA, cid=3000
      

      4. If change to RUNNING in step3 happens before step2, then step2 will throw exception.

      
      2020-06-19 07:03:18,169 [Thread-6194] INFO  impl.RoleInfo (RoleInfo.java:updateAndGet(143)) - s4: start FollowerState
      2020-06-19 07:03:18,174 [Thread-6194] ERROR netty.TestRaftWithNetty (ExitUtils.java:terminate(133)) - Terminating with exit status -1: Failed to kill/restart server: s4
      2020-06-19T07:03:18.1918474Z java.lang.IllegalStateException: ILLEGAL TRANSITION: In s4, RUNNING -> RUNNING
      2020-06-19T07:03:18.1918899Z 	at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:63)
      2020-06-19T07:03:18.1919240Z 	at org.apache.ratis.util.LifeCycle$State.validate(LifeCycle.java:115)
      2020-06-19T07:03:18.1919558Z 	at org.apache.ratis.util.LifeCycle.transition(LifeCycle.java:155)
      2020-06-19T07:03:18.1919878Z 	at org.apache.ratis.server.impl.RaftServerImpl.startAsFollower(RaftServerImpl.java:214)
      2020-06-19T07:03:18.1920206Z 	at org.apache.ratis.server.impl.RaftServerImpl.start(RaftServerImpl.java:186)
      2020-06-19T07:03:18.1920520Z 	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
      2020-06-19T07:03:18.1920839Z 	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
      2020-06-19T07:03:18.1921330Z 	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
      2020-06-19T07:03:18.1921639Z 	at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
      2020-06-19T07:03:18.1921951Z 	at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
      2020-06-19T07:03:18.1922261Z 	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
      2020-06-19T07:03:18.1922575Z 	at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
      2020-06-19T07:03:18.1922885Z 	at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
      2020-06-19T07:03:18.1925464Z 	at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
      2020-06-19T07:03:18.1940816Z 	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
      2020-06-19T07:03:18.1953283Z 	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
      2020-06-19T07:03:18.1967610Z 	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
      2020-06-19T07:03:18.1980549Z 	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
      2020-06-19T07:03:18.1991620Z 	at org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:301)
      2020-06-19T07:03:18.1991958Z 	at org.apache.ratis.MiniRaftCluster.restartServer(MiniRaftCluster.java:312)
      2020-06-19T07:03:18.1992275Z 	at org.apache.ratis.MiniRaftCluster.restartServer(MiniRaftCluster.java:304)
      2020-06-19T07:03:18.1992609Z 	at org.apache.ratis.RaftBasicTests.lambda$killAndRestartServer$2(RaftBasicTests.java:100)
      2020-06-19T07:03:18.1992920Z 	at java.lang.Thread.run(Thread.java:748)
      
      

      Attachments

        Issue Links

          Activity

            People

              yjxxtd runzhiwang
              yjxxtd runzhiwang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m