[CASSANDRA-16616] Harden internode message resource limit accounting against serialization failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.0-rc1, 4.0
Component/s: Messaging/Internode
Labels:
None

Bug Category:
Degradation - Other Exception
Severity:
Low
Complexity:
Normal
Discovered By:
Adhoc Test
Platform:

All
Impacts:

None
Since Version:

4.0-alpha1
Source Control Link:

https://github.com/apache/cassandra/commit/3918a67e67d2de8064dc98beb5166a5491c80b1e
Test and Documentation Plan:

Hide

Branch
PR
CircleCI

Show
Branch PR CircleCI

Description

If the internode messaging exception recovery code fails and is unable to correctly adjust the resource limits for an OutboundConnection, it affects the other connection types sharing the same OutboundConnections so that any of the connections could hit assert using >= 0; in
org.apache.cassandra.net.ResourceLimits.Concurrent#release.

While it is possible to modify all of the outbound connection code to re-initialize all of the connections with a correct limit, the effort to test and maintain the recovery code seems too high for something that should "never happen" (except it did once, which is why it needs hardening). The safer option is to kill the JVM and have whatever external monitoring is in place restart the instance in a known good state.

Additionally, the logging for dropping outbound messages that have expired or are unserializable messages takes place after the recovery handling logic. If there are problems with the recovery logic that throw an exception, the message is never logged for future diagnosis. Logging should take place first, and then releasing capacity/handling the expiration/serialization.

Discovered on a branch modified for testing that threw an exception in the Verb.serializeSize method.

Attachments

Activity

People

Assignee:: Jon Meredith

Reporter:: Jon Meredith

Authors:: Jon Meredith

Reviewers:: Benjamin Lerer, David Capwell

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Apr/21 21:26

Updated:: 16/Mar/22 15:40

Resolved:: 20/Apr/21 17:27

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m