[CASSANDRA-18580] Baseline Metrics for Accord Transactions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 5.x
Component/s: Accord, Observability/JMX, Observability/Metrics
Labels:
None

Epic Link:
CEP-15: Accord Beta
Change Category:
Operability
Complexity:
Normal
Platform:

All
Impacts:

None
Source Control Link:

https://github.com/apache/cassandra/commit/79b3dc07e7f01a315df8db3724db4c1064b8eac9
Test and Documentation Plan:

Hide

run tests

Show
run tests

Description

Based on some conversations w/ benedict and dcapwell, this is the initial set of metrics that seem both feasible to implement and useful as we monitor the health of a cluster performing Accord transactions:

1.) Basic latency metrics for transactions up to the point of COMMIT and rate metrics for preemption, failure, and timeouts at the coordinator.

This has already been implemented and split into read and write-specific metrics. Our position for now is that metrics around preemption should be useful in place of a more difficult-to-define metric around how many transactions are completed via recovery.

2.) Global cache stats/metrics (i.e. aggregated for all command stores)

We could, at some point, build metrics scoped to a specific CommandStore, but they might be awkward in MBean/JMX space, as command stores would have to be identified by ID or key range…the latter possibly being able to change across epochs. (An alternative would be just publishing command store-specific stats on-demand to a virtual table instead.)

3.) Something like a decaying histogram of the number of dependencies per transaction (or per partial transaction).

If this is getting worse over time, it could be useful to know/be a way for us to detect that contention is increasing. We should be able to hook this up to ProgressLog notifications. Recording for PartialDeps/PartialTxn (which ProgressLog gives us at pre-accept) seems acceptable, given this is a directional metric.

Attachments

Issue Links

is related to

CASSANDRA-18732 Baseline Diagnostic vtables for Accord

Resolved

links to

Accord PR

cep-15-accord PR

GitHub Pull Request #2534

Activity

People

Assignee:: Jacek Lewandowski

Reporter:: Caleb Rackliffe

Authors:: Jacek Lewandowski

Reviewers:: Caleb Rackliffe, David Capwell, Henrik Ingo

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Jun/23 19:22

Updated:: 10/Oct/23 10:13

Resolved:: 10/Oct/23 10:12

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

8h 40m