[CASSANDRA-13365] Nodes entering GC loop, does not recover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Awaiting Feedback
Priority: Normal
Resolution: Unresolved
Fix Version/s: None
Component/s: Legacy/Core
Labels:
None
Environment:

34-node cluster over 4 DCs
Linux CentOS 7.2 x86
Mix of 64GB/128GB RAM / node
Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz
High read volume, low write volume, occasional sstable bulk loading

Severity:
Normal
Since Version:

3.11.3

Description

Over the last week we've been observing two related problems affecting our Cassandra cluster

Problem 1: 1-few nodes per DC entering GC loop, not recovering

Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes recover, but some don't and log this:

2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure)  13G->11G(14G), 29.4127307 secs]
2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure)  13G->12G(14G), 28.1561881 secs]
2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure)  13G->13G(14G), 27.7019501 secs]
2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure)  13G->13G(14G), 27.1372267 secs]
2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure)  13G->13G(14G), 27.0703975 secs]
2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure)  13G->13G(14G), 27.3187768 secs]
2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure)  13G->13G(14G), 26.9493405 secs]
2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure)  13G->13G(14G), 27.5222085 secs]
2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure)  13G->13G(14G), 27.1769581 secs]
2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure)  13G->13G(14G), 27.4639031 secs]
2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure)  13G->13G(14G), 30.3480523 secs]
2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure)  13G->13G(14G), 27.8220513 secs]
2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure)  13G->13G(14G), 27.4691271 secs]
2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure)  13G->13G(14G), 27.0275733 secs]
2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure)  13G->13G(14G), 27.1902627 secs]
2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure)  13G->13G(14G), 27.8872850 secs]
2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure)  13G->13G(14G), 27.1609706 secs]
2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure)  13G->13G(14G), 27.3635138 secs]
2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure)  13G->13G(14G), 27.4143510 secs]
2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure)  13G->13G(14G), 27.8595986 secs]
2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure)  13G->13G(14G), 27.5242543 secs]
2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure)  13G->13G(14G), 30.8930130 secs]
2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure)  13G->13G(14G), 27.6523013 secs]
2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure)  13G->13G(14G), 27.3030198 secs]
2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure)  13G->13G(14G), 27.1594154 secs]
2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure)  13G->13G(14G), 27.3766949 secs]
2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure)  13G->13G(14G), 27.5756770 secs]
2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure)  13G->13G(14G), 27.5541813 secs]
2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure)  13G->13G(14G), 33.7510103 secs]
2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure)  13G->13G(14G), 27.4856611 secs]
2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure)  13G->13G(14G), 28.1269335 secs]
2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure)  13G->13G(14G), 27.7830448 secs]
2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure)  13G->13G(14G), 27.3527326 secs]
2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure)  13G->13G(14G), 27.5828941 secs]
2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure)  13G->13G(14G), 27.9303030 secs]
2017-03-21T11:40:05.265-0400: 55121.828: [Full GC (Allocation Failure)  13G->13G(14G), 36.9902867 secs]
2017-03-21T11:40:42.400-0400: 55158.963: [Full GC (Allocation Failure)  13G->13G(14G), 27.6835744 secs]
2017-03-21T11:41:10.529-0400: 55187.091: [Full GC (Allocation Failure)  13G->13G(14G), 27.1899555 secs]
2017-03-21T11:41:38.018-0400: 55214.581: [Full GC (Allocation Failure)  13G->13G(14G), 27.7309706 secs]
2017-03-21T11:42:06.062-0400: 55242.624: [Full GC (Allocation Failure)  13G->13G(14G), 29.2103234 secs]
2017-03-21T11:42:35.579-0400: 55272.141: [Full GC (Allocation Failure)  13G->13G(14G), 27.9902741 secs]
2017-03-21T11:43:03.955-0400: 55300.518: [Full GC (Allocation Failure)  13G->13G(14G), 28.0764545 secs]
2017-03-21T11:43:32.186-0400: 55328.748: [Full GC (Allocation Failure)  13G->13G(14G), 28.3913296 secs]
2017-03-21T11:44:00.810-0400: 55357.373: [Full GC (Allocation Failure)  13G->13G(14G), 27.0945131 secs]
2017-03-21T11:44:28.343-0400: 55384.905: [Full GC (Allocation Failure)  13G->13G(14G), 27.3357763 secs]
2017-03-21T11:44:55.909-0400: 55412.471: [Full GC (Allocation Failure)  13G->13G(14G), 27.9216626 secs]
2017-03-21T11:45:24.070-0400: 55440.632: [Full GC (Allocation Failure)  13G->13G(14G), 29.8189953 secs]
2017-03-21T11:45:54.092-0400: 55470.654: [Full GC (Allocation Failure)  13G->13G(14G), 27.9992085 secs]
2017-03-21T11:46:22.219-0400: 55498.782: [Full GC (Allocation Failure)  13G->13G(14G), 27.4622100 secs]
2017-03-21T11:46:49.983-0400: 55526.545: [Full GC (Allocation Failure)  13G->13G(14G), 27.9159143 secs]
2017-03-21T11:47:18.029-0400: 55554.592: [Full GC (Allocation Failure)  13G->13G(14G), 28.5207919 secs]
2017-03-21T11:47:46.938-0400: 55583.500: [Full GC (Allocation Failure)  13G->13G(14G), 27.0355291 secs]
2017-03-21T11:48:14.235-0400: 55610.798: [Full GC (Allocation Failure)  13G->13G(14G), 35.6284865 secs]
2017-03-21T11:48:50.013-0400: 55646.575: [Full GC (Allocation Failure)  13G->13G(14G), 27.3299061 secs]
2017-03-21T11:49:17.592-0400: 55674.154: [Full GC (Allocation Failure)  13G->13G(14G), 27.2544348 secs]

When a node enters the above state:, it becomes useless over 1-5 minutes:

Client connections die
New connections time out
CPU usage drops - all threads do nothing except for 1 thread near 100% CPU ("VM Thread" os_prio=0 tid=0x00007ffaf8209bf0 nid=0x2d53 runnable"), doing the above GC in a loop
JMX becomes unqueriable
Regular kill signal is ignored (needs kill -9)

I've managed to extract a full heap dump of a node while in this state, but it's 20GB and I have not been able to find a tool that can analyze it.

I've extracted a histogram of heap objects. From a healthy node:

 num     #instances         #bytes  class name
----------------------------------------------
   1:      12688007     1116544616  io.netty.buffer.PooledUnsafeDirectByteBuf
   2:      24938940      798046080  io.netty.util.Recycler$DefaultHandle
   3:      11457548      733283072  io.netty.channel.ChannelOutboundBuffer$Entry
   4:       9582138      459942624  java.nio.HeapByteBuffer
   5:         16974      393432416  [I
   6:       2014719      311356992  [B
   7:       4982167      181324848  [Ljava.lang.Object;
   8:          8531      118782704  [Lio.netty.util.Recycler$DefaultHandle;
   9:       2827662      113106480  org.apache.cassandra.db.rows.BufferCell
  10:       2737816       87610112  org.apache.cassandra.db.rows.BTreeRow
  11:       2740110       70228240  [Ljava.nio.ByteBuffer;
  12:       2738081       65713944  org.apache.cassandra.db.BufferClustering
  13:       1763956       56446592  com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
  14:       1763736       56439552  com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node
  15:       2205825       52939800  com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
  16:       1617026       51744832  java.util.concurrent.ConcurrentHashMap$Node
  17:       1769934       42478416  org.apache.cassandra.cache.KeyCacheKey

From an unhealthy node:

 num     #instances         #bytes  class name
----------------------------------------------
   1:     103726026     4978849248  java.nio.HeapByteBuffer
   2:      95646048     2832233872  [B
   3:      28076465     1207916952  [Ljava.lang.Object;
   4:      12803448     1126703424  io.netty.buffer.PooledUnsafeDirectByteBuf
   5:      25054634      801748288  io.netty.util.Recycler$DefaultHandle
   6:      11457548      733283072  io.netty.channel.ChannelOutboundBuffer$Entry
   7:      24725380      593409120  java.util.ArrayList
   8:       5808290      185865280  java.util.RegularEnumSet
   9:       2870889      160769784  org.apache.cassandra.transport.messages.ResultMessage$Rows
  10:       2937336      140992128  io.netty.buffer.SlicedAbstractByteBuf
  11:          8854      118773984  [Lio.netty.util.Recycler$DefaultHandle;
  12:       2830805      113232200  org.apache.cassandra.db.rows.BufferCell
  13:       2937336       93994752  org.apache.cassandra.transport.Frame$Header
  14:       2870928       91869696  org.apache.cassandra.cql3.ResultSet$ResultMetadata
  15:       2728627       87316064  org.apache.cassandra.db.rows.BTreeRow
  16:       2947033       70728792  java.util.concurrent.ConcurrentLinkedQueue$Node
  17:       2937336       70496064  org.apache.cassandra.transport.Frame

The increase is clear in these top 3 object types:

   1:     103726026     4978849248  java.nio.HeapByteBuffer
   2:      95646048     2832233872  [B
   3:      28076465     1207916952  [Ljava.lang.Object;

Other related observations:

Client socket states: Essentially the JVM has stopped working, and the recv-queue is piling up kernel-side but not being consumed app-side. Worth noting that the peers (clients) have given up and closed the connections on their end, but they still appear ESTABlished on cassandra's side since it hasn't drain the read buffer till EOF/error/complete the closure:

ESTAB      558612 0      172.30.72.133:9042               172.30.72.145:11418
ESTAB      564172 0      172.30.72.133:9042               172.30.72.145:12420
ESTAB      545284 0      172.30.72.133:9042               172.30.72.145:18212
ESTAB      565468 0      172.30.72.133:9042               172.30.72.145:18346
ESTAB      525884 0      172.30.72.133:9042               172.30.72.145:18352
ESTAB      532324 0      172.30.72.133:9042               172.30.72.145:28896
ESTAB      517592 0      172.30.72.133:9042               172.30.72.145:34504
ESTAB      553392 0      172.30.72.133:9042               172.30.72.145:36926
ESTAB      581952 0      172.30.72.133:9042               172.30.72.145:40684
ESTAB      543668 0      172.30.72.133:9042               172.30.72.145:42282
ESTAB      511860 0      172.30.72.133:9042               172.30.72.145:47350
ESTAB      590380 0      172.30.72.133:9042               172.30.72.145:47566
ESTAB      583120 0      172.30.72.133:9042               172.30.72.145:50386
ESTAB      536772 0      172.30.72.133:9042               172.30.72.145:52656
ESTAB      31     0      172.30.72.133:9042               172.30.72.145:52946
ESTAB      532140 0      172.30.72.133:9042               172.30.72.145:53618
ESTAB      14397  0      172.30.72.133:9042               172.30.72.150:10016
ESTAB      6444   0      172.30.72.133:9042               172.30.72.150:12176
ESTAB      7568   0      172.30.72.133:9042               172.30.72.150:12934
ESTAB      139479 0      172.30.72.133:9042               172.30.72.150:13186
ESTAB      6896   0      172.30.72.133:9042               172.30.72.150:17316
ESTAB      7808   0      172.30.72.133:9042               172.30.72.150:17448
ESTAB      8712   0      172.30.72.133:9042               172.30.72.150:17748
ESTAB      6944   0      172.30.72.133:9042               172.30.72.150:18472
ESTAB      136700 0      172.30.72.133:9042               172.30.72.150:19860
ESTAB      16260  0      172.30.72.133:9042               172.30.72.150:20696
ESTAB      8943   0      172.30.72.133:9042               172.30.72.150:21208
ESTAB      132877 0      172.30.72.133:9042               172.30.72.150:21420
ESTAB      6936   0      172.30.72.133:9042               172.30.72.150:25440
ESTAB      92     0      172.30.72.133:9042               172.30.72.150:25778
ESTAB      137312 0      172.30.72.133:9042               172.30.72.150:28258
ESTAB      40     0      172.30.72.133:9042               172.30.72.150:28348
ESTAB      6940   0      172.30.72.133:9042               172.30.72.150:29718
ESTAB      6416   0      172.30.72.133:9042               172.30.72.150:29858
ESTAB      6580   0      172.30.72.133:9042               172.30.72.150:30518
ESTAB      4202   0      172.30.72.133:9042               172.30.72.150:30948
ESTAB      6060   0      172.30.72.133:9042               172.30.72.150:35276
ESTAB      0      0      172.30.72.133:9042               172.30.72.150:40114
ESTAB      134303 0      172.30.72.133:9042               172.30.72.150:42212
ESTAB      396    0      172.30.72.133:9042               172.30.72.150:44848
ESTAB      136450 0      172.30.72.133:9042               172.30.72.150:47742
ESTAB      8318   0      172.30.72.133:9042               172.30.72.150:51054
ESTAB      11976  0      172.30.72.133:9042               172.30.72.150:54560
ESTAB      136193 0      172.30.72.133:9042               172.30.72.150:56094
ESTAB      138737 0      172.30.72.133:9042               172.30.72.150:60382
ESTAB      7996   0      172.30.72.133:9042               172.30.72.150:60932
ESTAB      137235 0      172.30.72.133:9042               172.30.72.150:61568
ESTAB      8184   0      172.30.72.133:9042               172.30.72.150:63054
ESTAB      985276 0      172.30.72.133:9042               172.30.72.175:23494
ESTAB      601528 0      172.30.72.133:9042               172.30.72.175:24140
ESTAB      557844 0      172.30.72.133:9042               172.30.72.175:42248
ESTAB      469344 0      172.30.72.133:9042               172.30.72.175:59704
ESTAB      507160 0      172.30.72.133:9042               172.30.72.185:11958
ESTAB      449196 0      172.30.72.133:9042               172.30.72.185:15458
ESTAB      487796 0      172.30.72.133:9042               172.30.72.185:16458
ESTAB      441132 0      172.30.72.133:9042               172.30.72.185:23194
ESTAB      486664 0      172.30.72.133:9042               172.30.72.185:25276
ESTAB      554780 0      172.30.72.133:9042               172.30.72.185:29102
ESTAB      496248 0      172.30.72.133:9042               172.30.72.185:29428
ESTAB      420048 0      172.30.72.133:9042               172.30.72.185:39650
ESTAB      498704 0      172.30.72.133:9042               172.30.72.185:46424
ESTAB      506488 0      172.30.72.133:9042               172.30.72.185:49722
ESTAB      509416 0      172.30.72.133:9042               172.30.72.185:55838
ESTAB      511956 0      172.30.72.133:9042               172.30.72.185:57670
ESTAB      496720 0      172.30.72.133:9042               172.30.72.185:59884

perf top during this state:

  19.73%  libjvm.so                                              [.] InstanceKlass::oop_follow_contents
  14.05%  libjvm.so                                              [.] ObjArrayKlass::oop_follow_contents
   7.83%  libjvm.so                                              [.] MarkSweep::follow_stack
   6.38%  libjvm.so                                              [.] HeapRegion::block_size
   5.41%  libjvm.so                                              [.] OtherRegionsTable::add_reference
   5.12%  libjvm.so                                              [.] UpdateRSOopClosure::do_oop
   4.77%  libjvm.so                                              [.] HeapRegion::block_is_obj

Frequency: Affecting nodes in our 2 busier DCs (other 2 DCs serve less volume). ~ 2-10 times per day. Not always same nodes (though lately 2 hiccup more frequently than others).

Tried:

Upgrading cassandra from 3.9 to 3.10
Increasing heap size from 8GB to 14GB
Upgrading JVM 1.8 from u112 to u121
Switching from CMS to G1GC

The last change, switching from CMS to G1GC reduced the frequency of the problem but has not eliminated it.

Problem 2: Read volume in DC depressed

With our RF set to 2 per DC, if 2 or more nodes in a DC enter the above GC loop described in problem 1, all nodes in the same DC accumulate very high active/pending/blocked Task-Transport queues.

Total read volume for the DC drops to 10%-30% of normal.

Tried:

Increasing -Dcassandra.max_queued_native_transport_requests to 4096
Reducing read_request_timeout_in_ms to 400
Reducing write_request_timeout_in_ms to 800

Improved things slightly but not too much.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mina Naguib

Tester:: Adam

Votes:: 2 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 22/Mar/17 14:05

Updated:: 28/Apr/22 13:40