Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-1490

Follower should not change to candidate if already stopped

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.3.0
    • server
    • None

    Description

      RaftServerImpl creates a new FollowerState for each vote request received. The old FollowerState may trigger change to candidate role after it was shutdown by the server. Both RaftServerImpl#requestVote and the change to candidate by FollowerState requires lock on server (RaftServerImpl object). The order of events which triggers the problem:

      1. RaftServerImpl executing requestVote with lock on itself
      2. FollowerState checks conditions, then waits for lock on server
      3. RaftServerImpl shuts down old FollowerState, starts new one
      4. RaftServerImpl sets last RPC time on new FollowerState
      5. RaftServerImpl releases lock
      6. old FollowerState enters block with lock on server, triggers change to candidate

      Log from Ozone DN where this was seen:

      c7514957 votes for 18b95dfb:

      2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO  server.RaftServer$Division (RaftServerImpl.java:requestVote(1046)) - c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: receive requestVote(ELECTION, 18b95dfb-8ef7-4011-a842-0dc8d0be8521, group-9EECFC4A6075, 2, (t:0, i:0))
      2022-01-07 01:24:07,792 [grpc-default-executor-0] INFO  impl.VoteContext (VoteContext.java:log(48)) - c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FOLLOWER: accept ELECTION from 18b95dfb-8ef7-4011-a842-0dc8d0be8521: our priority 0 <= candidate's priority 1
      2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO  server.RaftServer$Division (RaftServerImpl.java:setRole(290)) - c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from  FOLLOWER to FOLLOWER at term 2 for candidate:18b95dfb-8ef7-4011-a842-0dc8d0be8521
      2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO  impl.RoleInfo (RoleInfo.java:shutdownFollowerState(108)) - c7514957-7287-4cfe-8c11-55f780daee7f: shutdown c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
      2022-01-07 01:24:07,793 [grpc-default-executor-0] INFO  impl.RoleInfo (RoleInfo.java:updateAndGet(130)) - c7514957-7287-4cfe-8c11-55f780daee7f: start c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
      2022-01-07 01:24:07,794 [grpc-default-executor-0] INFO  server.RaftServer$Division (RaftServerImpl.java:requestVote(1078)) - c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075 replies to ELECTION vote request: 18b95dfb-8ef7-4011-a842-0dc8d0be8521<-c7514957-7287-4cfe-8c11-55f780daee7f#0:OK-t2. Peer's state: c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075:t2, leader=null, voted=18b95dfb-8ef7-4011-a842-0dc8d0be8521, raftlog=c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-SegmentedRaftLog:OPENED:c-1, conf=-1: [18b95dfb-8ef7-4011-a842-0dc8d0be8521|rpc:10.1.0.7:39307|priority:1, c7514957-7287-4cfe-8c11-55f780daee7f|rpc:10.1.0.7:42065|priority:0, baceac6d-bbfe-42b4-9bde-396690ef56d9|rpc:10.1.0.7:38573|priority:0], old=null
      

      but just after 3ms starts new election due to RPC timeout:

      2022-01-07 01:24:07,795 [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  impl.FollowerState (FollowerState.java:run(134)) - c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState: change to CANDIDATE, lastRpcElapsedTime:5172276476ns, electionTimeout:5172ms
      2022-01-07 01:24:07,796 [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  impl.RoleInfo (RoleInfo.java:shutdownFollowerState(108)) - c7514957-7287-4cfe-8c11-55f780daee7f: shutdown c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState
      2022-01-07 01:24:07,796 [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  server.RaftServer$Division (RaftServerImpl.java:setRole(290)) - c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075: changes role from  FOLLOWER to CANDIDATE at term 2 for changeToCandidate
      2022-01-07 01:24:07,796 [c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-FollowerState] INFO  impl.RoleInfo (RoleInfo.java:updateAndGet(130)) - c7514957-7287-4cfe-8c11-55f780daee7f: start c7514957-7287-4cfe-8c11-55f780daee7f@group-9EECFC4A6075-LeaderElection7
      

      Full log: https://github.com/elek/ozone-build-results/blob/master/2022/01/07/12409/it-client/hadoop-ozone/integration-test/org.apache.hadoop.ozone.client.rpc.TestDeleteWithSlowFollower.txt

      Attachments

        Issue Links

          Activity

            People

              adoroszlai Attila Doroszlai
              adoroszlai Attila Doroszlai
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m