[CASSANDRA-11363] High Blocked NTR When Connecting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.1.16, 2.2.8, 3.0.10, 3.10
Component/s: Legacy/Coordination
Labels:
None

Severity:
Normal

Description

When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the machine load increases to very high levels (> 120 on an 8 core machine) and native transport requests get blocked in tpstats.

I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.

The issue does not seem to affect the nodes running 2.1.9.

The issue seems to coincide with the number of connections OR the number of total requests being processed at a given time (as the latter increases with the former in our system)

Currently there is between 600 and 800 client connections on each machine and each machine is handling roughly 2000-3000 client requests per second.

Disabling the binary protocol fixes the issue for this node but isn't a viable option cluster-wide.

Here is the output from tpstats:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         8        8387821         0                 0
ReadStage                         0         0         355860         0                 0
RequestResponseStage              0         7        2532457         0                 0
ReadRepairStage                   0         0            150         0                 0
CounterMutationStage             32       104         897560         0                 0
MiscStage                         0         0              0         0                 0
HintedHandoff                     0         0             65         0                 0
GossipStage                       0         0           2338         0                 0
CacheCleanupExecutor              0         0              0         0                 0
InternalResponseStage             0         0              0         0                 0
CommitLogArchiver                 0         0              0         0                 0
CompactionExecutor                2       190            474         0                 0
ValidationExecutor                0         0              0         0                 0
MigrationStage                    0         0             10         0                 0
AntiEntropyStage                  0         0              0         0                 0
PendingRangeCalculator            0         0            310         0                 0
Sampler                           0         0              0         0                 0
MemtableFlushWriter               1        10             94         0                 0
MemtablePostFlush                 1        34            257         0                 0
MemtableReclaimMemory             0         0             94         0                 0
Native-Transport-Requests       128       156         387957        16            278451

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
MUTATION                     0
COUNTER_MUTATION             0
BINARY                       0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

Attached is the jstack output for both CMS and G1GC.

Flight recordings are here:
https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr

It is interesting to note that while the flight recording was taking place, the load on the machine went back to healthy, and when the flight recording finished the load went back to > 100.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cassandra-102-cms.stack
16/Mar/16 18:11
1.05 MB
Russell Bradberry
cassandra-102-g1gc.stack
16/Mar/16 18:11
1.01 MB
Russell Bradberry
max_queued_ntr_property.txt
05/Aug/16 17:05
1.0 kB
Romain Hardouin
proxyhistograms.png
06/Jun/18 19:14
55 kB
Dinesh Kumar Attem
tablehistograms.png
06/Jun/18 19:14
52 kB
Dinesh Kumar Attem
tablestats.png
06/Jun/18 19:14
37 kB
Dinesh Kumar Attem
thread-queue-2.1.txt
18/Jul/16 20:14
0.9 kB
T Jake Luciani
tpstats.png
06/Jun/18 19:14
128 kB
Dinesh Kumar Attem

Activity

People

Assignee:: T Jake Luciani

Reporter:: Russell Bradberry

Authors:: T Jake Luciani

Reviewers:: Nate McCall

Votes:: 6 Vote for this issue

Watchers:: 30 Start watching this issue

Dates

Created:: 16/Mar/16 18:11

Updated:: 16/Apr/19 09:30

Resolved:: 20/Sep/16 02:52