Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
-
Retrying tests
Description
ReplicationQueuesHBaseImpl will abort the server if any of its HBase Table writes/reads fails. We should figure out a reasonable retry limit and pause duration for these operations.
As of now the timeouts look like:
Table initialization:
240 retries
1 minute pause (because the Master may not be initialized yet, createTable retries are immediately rejected by PleaseHoldException, so we should sleep in between RPC requests)
1 minute RPC timeouts
Total: At minimum 2 hours of retries
Normal Replication Table operations:
240 retries
100 millis pause (because we assume the cluster is in a more stable state, we assume most exceptions will be RPC timeouts, so I am using the standard RPC pause)
1 minute RPC timeouts
Total: Assuming operations fail because of RPC timeouts, a minimum of 2 hours of retries. With just pauses we only have 24 seconds.
All of these timeouts are configurable too though.