Details
-
Improvement
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Operability
-
Normal
-
All
-
None
-
Description
Based on some conversations w/ benedict and dcapwell, this is the initial set of metrics that seem both feasible to implement and useful as we monitor the health of a cluster performing Accord transactions:
1.) Basic latency metrics for transactions up to the point of COMMIT and rate metrics for preemption, failure, and timeouts at the coordinator.
This has already been implemented and split into read and write-specific metrics. Our position for now is that metrics around preemption should be useful in place of a more difficult-to-define metric around how many transactions are completed via recovery.
2.) Global cache stats/metrics (i.e. aggregated for all command stores)
We could, at some point, build metrics scoped to a specific CommandStore, but they might be awkward in MBean/JMX space, as command stores would have to be identified by ID or key rangeā¦the latter possibly being able to change across epochs. (An alternative would be just publishing command store-specific stats on-demand to a virtual table instead.)
3.) Something like a decaying histogram of the number of dependencies per transaction (or per partial transaction).
If this is getting worse over time, it could be useful to know/be a way for us to detect that contention is increasing. We should be able to hook this up to ProgressLog notifications. Recording for PartialDeps/PartialTxn (which ProgressLog gives us at pre-accept) seems acceptable, given this is a directional metric.
Attachments
Issue Links
- is related to
-
CASSANDRA-18732 Baseline Diagnostic vtables for Accord
- Resolved
- links to