[CASSANDRA-20059] TCM's Retry.Deadline#retryIndefinitely is dangerous if used with RemoteProcessor as the deadline does not impact message retries - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 5.1
Component/s: Transactional Cluster Metadata
Labels:
None

Bug Category:
Correctness - API / Semantic Implementation
Severity:
Normal
Complexity:
Low Hanging Fruit
Discovered By:
Code Inspection
Platform:

All
Impacts:

None
Since Version:

5.1
Source Control Link:

https://github.com/apache/cassandra/commit/4f49ca5e29d9c7207654a1f3c4eac9c9f0b84e5e
Test and Documentation Plan:

Hide

existing tests

Show
existing tests

Description

public static Deadline retryIndefinitely(long timeoutNanos, Meter retryMeter)
{
    return new Deadline(Clock.Global.nanoTime() + timeoutNanos,
                        new Retry.Jitter(Integer.MAX_VALUE, DEFAULT_BACKOFF_MS, new Random(), retryMeter))
    {
        @Override
        public boolean reachedMax()
        {
            return false;
        }

        @Override
        public long remainingNanos()
        {
            return timeoutNanos;
        }

        public String toString()
        {
            return String.format("RetryIndefinitely{tries=%d}", currentTries());
        }
    };
}

Sample usage pattern (example is in Accord, but same pattern exists in RemoteProcessor.commit)

Promise<LogState> request = new AsyncPromise<>();
List<InetAddressAndPort> candidates = new ArrayList<>(log.metadata().fullCMSMembers());
sendWithCallbackAsync(request,
                      Verb.TCM_RECONSTRUCT_EPOCH_REQ,
                      new ReconstructLogState(lowEpoch, highEpoch, includeSnapshot),
                      new CandidateIterator(candidates),
                      retryPolicy);
return request.get(retryPolicy.remainingNanos(), TimeUnit.NANOSECONDS);

The issue here is that the networking retry has no clue that we gave up waiting on the request, so we will keep retrying until success! The reason for this is “reachedMax” is used to see if its safe to run again, but it isn’t as the deadline has passed!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ci_summary-trunk-3fa63cf81ce03bfa45c2b312c1c2846a1d84eee5.html
14/Nov/24 18:28
34 kB
David Capwell
result_details-trunk-3fa63cf81ce03bfa45c2b312c1c2846a1d84eee5.tar.gz
14/Nov/24 18:28
3.43 MB
David Capwell
ci_summary.html
10/Dec/24 23:15
65 kB
David Capwell
result_details.tar.gz
10/Dec/24 23:15
2.92 MB
David Capwell

Issue Links

links to

GH: cep-15-accord

GitHub Pull Request #3670

Activity

People

Assignee:: David Capwell

Reporter:: David Capwell

Authors:: David Capwell

Reviewers:: Alex Petrov, Sam Tunnicliffe

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Nov/24 19:05

Updated:: 10/Dec/24 23:28

Resolved:: 10/Dec/24 23:28

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10m