Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.10.1
-
None
-
None
Description
Lately we found a few tablets in one of our clusters are unhealthy, the ksck output is like:
Tablet Summary Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' active configs disagree with the leader master's 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] All reported replicas are: A = 7380d797d2ea49e88d71091802fb1c81 B = d1952499f94a4e6087bee28466fcb09f C = 47af52df1adc47e1903eb097e9c88f2e D = 08beca5ed4d04003b6979bf8bac378d2 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---------------+------------------+--------------+--------------+------------ master | A B C* | | | Yes A | A B C* | 5 | -1 | Yes B | A B C | 5 | -1 | Yes C | A B C* D~ | 5 | 54649 | No Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active configs disagree with the leader master's d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING All reported replicas are: A = d1952499f94a4e6087bee28466fcb09f B = 47af52df1adc47e1903eb097e9c88f2e C = 5a8aeadabdd140c29a09dabcae919b31 D = 14632cdbb0d04279bc772f64e06389f9 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---------------+------------------+--------------+--------------+------------ master | A B* C | | | Yes A | A B* C | 5 | 5 | Yes B | A B* C D~ | 5 | 96176 | No C | A B* C | 5 | 5 | Yes Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' active configs disagree with the leader master's a9eaff3cf1ed483aae849549999d649a (kudu-ts23): RUNNING f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] All reported replicas are: A = a9eaff3cf1ed483aae849549999d649a B = f75df4a6b5ce404884313af5f906b392 C = 47af52df1adc47e1903eb097e9c88f2e D = d1952499f94a4e6087bee28466fcb09f The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---------------+------------------+--------------+--------------+------------ master | A B C* | | | Yes A | A B C* | 1 | -1 | Yes B | A B C* | 1 | -1 | Yes C | A B C* D~ | 1 | 2 | No Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active configs disagree with the leader master's 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING All reported replicas are: A = 47af52df1adc47e1903eb097e9c88f2e B = f0f7b2f4b9d344e6929105f48365f38e C = f75df4a6b5ce404884313af5f906b392 D = d1952499f94a4e6087bee28466fcb09f The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---------------+------------------+--------------+--------------+------------ master | A* B C | | | Yes A | A* B C D~ | 1 | 1991 | No B | A* B C | 1 | 4 | Yes C | A* B C | 1 | 4 | Yes
These tablets couldn't recover for a couple of days until we restart kudu-ts27.
I found so many duplicated logs in kudu-ts27 are like:
I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing. I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is already a config change operation in progress. Unable to promote follower until it completes. Doing nothing.
There seems to be some RaftConfig change operations that somehow cannot complete.