[CASSANDRA-19260] org.apache.cassandra.tcm.ClusterMetadataService#commit does not catch up when rejected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 5.1
Component/s: Transactional Cluster Metadata
Labels:
None

Bug Category:
Correctness - Transient Incorrect Response
Severity:
Critical
Complexity:
Normal
Discovered By:
Performance Regression Test
Platform:

All
Impacts:

None
Since Version:

5.1
Source Control Link:

https://github.com/apache/cassandra/commit/3e6a551dbab6ecdc97b99f9ec3118316bfaf1802
Test and Documentation Plan:

Hide

Includes a test

Show
Includes a test

Description

This was found in the cep-15-accord branch (~~CASSANDRA-18804~~). The test that found this was a simple benchmark test.

1) deploy a 6 node cluster
2) create a table
3) in parallel launch many accord transactions

When accord gets a transaction it needs to make sure the table is “managed” by accord which uses TCM for this bookkeeping, this is just a List<TableId> in ClusterMetadata. We found that we detect that the table isn’t managed so we try to add it, we get a reject and the TCM epoch has not moved forward!

Debugging this it looks like org.apache.cassandra.tcm.RemoteProcessor#commit is the root cause as it only seems to try to catch up if there is a messaging error and not a TCM rejection! Given that the caller to TCM is not able to find the epoch to “wait” on I feel that this is a TCM issue as TCM normally tries to make sure success/rejects are blocking, but in this one case it appears not to be so

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ci_summary.html
12/Jan/24 08:24
7 kB
Alex Petrov

Issue Links

links to

Trunk PR

Activity

People

Assignee:: Alex Petrov

Reporter:: David Capwell

Authors:: Alex Petrov

Reviewers:: Alex Petrov, Sam Tunnicliffe

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Jan/24 21:49

Updated:: 08/Apr/24 11:49

Resolved:: 19/Mar/24 15:45