[KUDU-2748] Leader master erroneously tries to tablet copy to a follower master due to race at startup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.9.0
Fix Version/s: 1.10.0
Component/s: None
Labels:
None

Code Review:
https://gerrit.cloudera.org/#/c/12770/

Description

I was investigating ~~KUDU-2734~~ and ran into a weird situation. The test runs with 3 masters and changes the value of a flag on the masters. To effect the change, it restarts the masters. Suppose the masters are labelled A, B, and C. Somewhat rarely (e.g. 8% of the time when run in TSAN with 8 stress threads), the following happens:

1. A and B are restarted successfully. They form a quorum and elect a leader (say A).
2. C is in the process of restarting. The ConsensusService is registered and C is accepting RPCs.
3. A sends C an UpdateConsensus RPC. However, C is still in the process of starting and has not yet initialized the systable. When C receives the UpdateConsensus call, as a result it responds with TABLET_NOT_FOUND, even though the proper response should be SERVICE_UNAVAILABLE.
4. A interprets TABLET_NOT_FOUND to mean that C needs to be copied to, and it tries forever to tablet copy to C. The copies never start because tablet copy is not implemented for masters.
5. C finishes its startup but does not receive UpdateConsensus from A because A is sending StartTableCopy requests. C calls pre-elections endlessly.

This effectively means the cluster is running with two masters until there is a leadership change. This caused the flakiness of KsckRemoteTest.TestClusterWithLocation because C never recognizes the leadership of A, so Ksck master consensus checks fail.

A regular tablet on a tablet server is not vulnerable to this. It's specific to how the master starts up.

Attachments

Activity

People

Assignee:: William Berkeley

Reporter:: William Berkeley

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Mar/19 21:23

Updated:: 21/Mar/19 01:05

Resolved:: 21/Mar/19 01:05