Status: In Progress
Affects Version/s: None
Fix Version/s: None
Currently master-stress-test is 5-7% flaky, failing during a create table operation:
Due to the frequent master failovers introduced by the test, CREATE TABLE operations are failing because not enough tablet servers are known to be alive by the current leader master, who likely was just started and quickly elected.
In this case the master returns an InvalidArgument status to the client, which is not retried. This indicates a real issue that could occur in a production cluster, if the leader master were restarted and quickly regained leadership. I'm not sure yet what the right fix is, I can think of at least a few:
- Change the return status to be ServiceUnavailable. The client will retry up to the timeout. The downside is that in legitimate scenarios where there aren't enough tablet servers the operation will take the full timeout to fail, and probably have a less useful error status type. Perhaps we could have a heuristic which says that if the leader hasn't been active for at least n * heartbeat_interval (where n is a small integer), then ServiceUnavailable is used.
- Change master-stress-test to use replication 1 tables. This makes it much less likely for the race to occur, although it's still possible. This also doesn't fix the underlying issue.
- Introduce a special case in the table creating thread of master-stress-test to retry the specific InvalidArgument status. Also doesn't fix the underlying issue.