Description
"suppose the admin operation timeout is 10 seconds, and that is the max timeout we use. but then how do we handle the case when:
1) the server we're talking to is in an I/O Pause (this is exactly what master_failover-itest tests)
2) we want to retry and find a new master in the mean time, and yet want to keep a timeout for this.
another idea i had for this is:
1) default rpc timeout (the minimum timeout, before we retry the rpc ops)
2) overall timeout.
right now effective default_admin_operation_timeout is (1)
and select_master_timeout is (2)
this way we keep default_timeout as usual, but now have another timeout we can use to detect slow nodes.
i think on TS we rely on this by the quorum reporting the new leader to master – and i think we're changing that too.
any thoughts on this?"
See: http://gerrit.sjc.cloudera.com:8080/?l=1399#/c/5483/20/src/kudu/client/meta_cache.cc for discussion