Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Currently if a tablet has become unavailable for some reason (eg it has lost a majority of replicas), the client will still faithfully retry up to its maximum timeout for a read or write operation. After that timeout, it will sometimes indicate a "timed out" error rather than something more indicative of the root cause.
The retry-on-unavailability behavior is desirable in the case of transient unavailability (eg a node has just failed and a re-election is occurring). But if the tablet has been unavailable for quite some time (eg longer than the client timeout, or longer than N heartbeat intervals for some N) than we can assume that it's unlikely to recover within the timeout, and it would be preferable to fail fast with an appropriate exception.
Attachments
Issue Links
- relates to
-
KUDU-2287 Add replica metric tracking time since there was a valid leader
- Resolved