[CASSANDRA-17424] Performance and Semantic Concerns w/ Metrics for Local vs. Remote Requests in StorageProxy - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.1-alpha1, 4.1
Component/s: Observability/Metrics
Labels:
None

Bug Category:
Correctness - API / Semantic Implementation
Severity:
Normal
Complexity:
Normal
Discovered By:
Code Inspection
Platform:

All
Impacts:

None
Since Version:

4.1
Source Control Link:

https://github.com/apache/cassandra/commit/57ab3afcf16970047d3df4656241cf0705e94bee
Test and Documentation Plan:

Hide

n/a

Show
n/a

Description

In ~~CASSANDRA-10023~~, we added two new metrics to both ClientRequestMetrics and ClientWriteRequestMetrics to represent requests where the driver either does or does not make a correct token-aware choice of coordinator. (Auditing driver behavior is listed as the primary goal of that Jira.)

There are, however, a few concerns we should address before this releases in 4.1:

1.) With paging enabled and a LIMIT < fetch size, IN queries can hit fetchRows() multiple times, so the number of local + remote requests isn’t the same as the number of queries marked in ClientRequestMetrics in readRegular().

2.) IN queries will potentially mark a bunch of “remote” requests even if one key in the IN set is “local”.

3.) Something similar happens with mutations. If StorageProxy#mutate() receives multiple mutations, we’ll mark against one of these new metrics in ClientWriteRequestMetrics for each mutation, while ClientWriteRequestMetrics will only register the actual client request once.

For cases 2 and 3, we may mark both local and remote requests for the same overall client request, which introduces ambiguity if these are intended to help audit driver coordinator selection behavior. There are a few options:

a.) We can accept the ambiguity, but then we haven’t really accomplished the goal of ~~CASSANDRA-10023~~ for some request types.

b.) We can simply not record any of these metrics for requests where multiple partitions/tokens are involved.

c.) We can be lenient, marking requests as “local” if any of the partitions/tokens involved in the client request are, in fact, local.

“c” feels like the option that preserves as much functionality as possible without being ambiguous, but problem #2 above is still tricky, given the way IN and GROUP BY queries behave w/ paging. (Perhaps ambiguity in that case is acceptable?)

In addition to the general ambiguity around the above…

4.) There is excessive object creation involved (on a hot path) in our determination of whether a request is local or remote. We should be able to mitigate this by getting rid of AbstractReadExecutor#getContactedReplicas() and relying on ReplicaPlan#lookup() rather than creating strings. (Even for writes, we should be able to push down marking into performWrite(), where the write ReplicaPlan is already available.)

Attachments

Issue Links

is caused by

CASSANDRA-10023 Emit a metric for number of local read and write calls

Resolved

Activity

People

Assignee:: Caleb Rackliffe

Reporter:: Caleb Rackliffe

Authors:: Caleb Rackliffe

Reviewers:: Jon Meredith, Marcus Eriksson

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 08/Mar/22 22:49

Updated:: 27/May/22 19:24

Resolved:: 29/Mar/22 16:59

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 20m