Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Correctness - Transient Incorrect Response
-
Critical
-
Normal
-
Performance Regression Test
-
All
-
None
-
Description
This was found in the cep-15-accord branch (CASSANDRA-18804). The test that found this was a simple benchmark test.
1) deploy a 6 node cluster
2) create a table
3) in parallel launch many accord transactions
When accord gets a transaction it needs to make sure the table is “managed” by accord which uses TCM for this bookkeeping, this is just a List<TableId> in ClusterMetadata. We found that we detect that the table isn’t managed so we try to add it, we get a reject and the TCM epoch has not moved forward!
Debugging this it looks like org.apache.cassandra.tcm.RemoteProcessor#commit is the root cause as it only seems to try to catch up if there is a messaging error and not a TCM rejection! Given that the caller to TCM is not able to find the epoch to “wait” on I feel that this is a TCM issue as TCM normally tries to make sure success/rejects are blocking, but in this one case it appears not to be so
Attachments
Attachments
Issue Links
- links to