Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
There's a class of issue that can be hard to debug, namely when things fail semi-silently on the client-side. We currently have glog_warning_messages and glog_error_messages, but it could be good to have more granular metrics. A few I have in mind:
- rpc errors, basically any "recv error"
- server-level errors, like when it says TOO BUSY.
- any kind of insert rejection, right now we have row key duplicates and memory pressure, but we're missing things like txn_tracker rejections, "not a leader".
- raft errors like dropping a follower because we don't have the WALs around and it's lagging too much.
There's probably more but the above would be a good start.