Details
Description
1 background : there is 5 node crash due to sys oom today , according to raft protocol, kudu should select follower and upgrade it to leader and provide service again,while it did not.
Found such error when issuing query via impala: "Unable to open scanner: Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 cate1_id=-2147483648, int32 chan_type=-2147483648, int32 county_id=-2147483648, int32 city_id=-2147483648, int32 province_id=-2147483648, 1) failed: timed out after deadline expired: timed out after deadline expired
"
2 analysis:
According to the bucket# , found the target tablet only has two replicas,which is odd. Meantime the tablet-server hosting the leader replica has crashed.
The follower can not upgrade to leader in that situation: only one leader and one follower ,leader dead, follower can not get majority of votes for its upgrading to leader(as only itself votes for itself).
Thus result in the unavailability of tablet while there is a follower left hosting the replica.
After restart kudu-server on the node which hosting the previous leader replica, Observed that the leader replica become follower and previous follower replica become leader, another follower replica is created and there is 3-replica raft-configuration again.
3 modifications:
follower should notice the abnormal situation where there is only two replica in raft-configuration: one leader and one follower, and contact master to correct it.
4 to do:
what cause the two-replica raft-configuration is still known.