Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1847

kudu-tserver should remove itself from raft-peer-config when met tablet corruption

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Invalid
    • 1.0.0
    • 1.0.0
    • cfile, consensus
    • None

    Description

      problem found:
      Today one of my tables became unwritable. From kudu-master , i found there is only one "FOLLOWER" left in raft-config of a tablet.
      After searching kudu-tserver.LOG i found error logs like this
      "I0124 03:29:16.000665 17144 raft_consensus.cc:380] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [term 173317 FOLLOWER]: Starting election with config: opid_index: 572616 local: false peers { permanent_uuid: "69947ffe22e245afb579287073c58dc2" member_type: VOTER last_known_addr

      { host: "peer_ip" port: 7050 }

      } peers { permanent_uuid: "1fa77467172b4ed7ba1a0a10e3dd67f8" member_type: VOTER last_known_addr

      { host: "localhost" port: 7050 }

      }
      I0124 03:29:16.001211 17144 leader_election.cc:223] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [CANDIDATE]: Term 173317 election: Requesting vote from peer 69947ffe22e245afb579287073c58dc2
      W0124 03:29:16.001549 15548 leader_election.cc:281] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [CANDIDATE]: Term 173317 election: Tablet error from VoteRequest() call to peer 69947ffe22e245afb579287073c58dc2: Illegal state: Tablet not RUNNING: FAILED: Not found: Can't find block: 0000000318394411
      I0124 03:29:16.001845 15548 leader_election.cc:248] T 8870bca7167f46c88099fb3236477530 P 1fa77467172b4ed7ba1a0a10e3dd67f8 [CANDIDATE]: Term 173317 election: Election decided. Result: candidate lost.
      "
      This logs indicate that the current follower(f1) of the tablet start leader election( after election timeout ), and found tablet on another follower(f2) is not running (corruption) . So the election failed.
      at the end only one follower of the tablet is alive.
      I also found the tablet of f2 has been corrupted for a several days.

      Hence i think this is a bug that we lack logic to remove a peer from RaftConfig when the tablet's data of the peer is corrupted.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bruceSz zhangsong
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: