Under certain circumstances, a client's AlterTable request may get "stuck" in a quorum of masters, causing the client to time out. The flaky test dashboard usually shows a couple instances of this failure in master_failover-itest::TestRenameTableSync.
How does this happen? Let's assume the following:
- Three masters
- Table with one tablet, replicated three times.
- It starts with a complicated master leader election in which master 1 is elected for term 1, master 2 for term 2, then master 1 steps down.
- This means the three tservers have had a chance to register with both master 1 and master 2.
- Now, the tablet's leader replica is elected, causing its next tablet report to include a "dirty" entry to master 2.
- Master 2 is killed, master 1 is reelected for term 3.
- The client issues AlterTable to master 1, but master 1 has no idea who the tablet leader is, so the AlterTablet RPC it issues fails.
- At this point, there are now outstanding master->leader AlterTablet RPCs.
- The tablet leader begins heartbeating to master 1. Its report is incremental and excludes registration information. Master 1 does not ask the tserver to register, because it already got that information in step 1 during the weird master leader election. The report includes the "tablet report" section but lists no tablets, so the master doesn't ask it to send a full report. No AlterTablet RPC is sent.
- At this point, the alter table won't make forward progress until the tablet leader's role changes and an AlterTablet RPC is sent, which may not happen for a long time, or at all (in an integration test).
Some possible solutions:
- When a tserver decides to heartbeat to a new leader master, it should always send a full tablet report.
- When the master crafts an AlterTablet RPC and tries to find the leader, it currently uses "soft state" to do so, state that's only updated in the event of a tablet role change or full tablet report. Instead, we could use "hard state" which, even though the master may be newly elected, should already include up-to-date consensus configuration information.
- Change tservers to always heartbeat to all masters. Doing so means that following a master leader election, the new master has up-to-date "soft state" and can find the leader when issuing an AlterTablet RPC.
- Add a list of tservers to the master's "hard state", so that TSDescriptors (needed when finding the leader tablet) can be instantiated without the need for full tablet reports.
At the moment I'm not sure which approach makes the most sense, or if there's a better one out there.