Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
ghx-label-11
Description
It would nice to have some impalad level metrics related to query retries. This would help answer questions like - how often are queries retried? how often are the retries actually successful? If queries are constantly being retried, then there is probably something wrong with the cluster.
Some possible metrics to add:
- Query retry rate (the rate at which queries are retried)
- This can be further divided by retry “type” - e.g. what caused the retry
- Potential categories would be:
- Queries retried due to failed RPCs
- Queries retried due to faulty disks
- Queries retried due to statestore detection of cluster membership changes
- A metric that measures how often query retries are actually successful (e.g. if a query is retried, does the retry succeed, or does it just fail again)
- This can help users determine if query retries are actually helping, or just adding overhead (e.g. if retries always fail then something is probably wrong)