Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13365

Nodes entering GC loop, does not recover

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Awaiting Feedback
    • Normal
    • Resolution: Unresolved
    • None
    • Legacy/Core
    • None
    • 34-node cluster over 4 DCs
      Linux CentOS 7.2 x86
      Mix of 64GB/128GB RAM / node
      Mix of 32/40 hardware threads / node, Xeon ~2.4Ghz
      High read volume, low write volume, occasional sstable bulk loading

    • Normal

    Description

      Over the last week we've been observing two related problems affecting our Cassandra cluster

      Problem 1: 1-few nodes per DC entering GC loop, not recovering

      Checking the heap usage stats, there's a sudden jump of 1-3GB. Some nodes recover, but some don't and log this:

      2017-03-21T11:23:02.957-0400: 54099.519: [Full GC (Allocation Failure)  13G->11G(14G), 29.4127307 secs]
      2017-03-21T11:23:45.270-0400: 54141.833: [Full GC (Allocation Failure)  13G->12G(14G), 28.1561881 secs]
      2017-03-21T11:24:20.307-0400: 54176.869: [Full GC (Allocation Failure)  13G->13G(14G), 27.7019501 secs]
      2017-03-21T11:24:50.528-0400: 54207.090: [Full GC (Allocation Failure)  13G->13G(14G), 27.1372267 secs]
      2017-03-21T11:25:19.190-0400: 54235.752: [Full GC (Allocation Failure)  13G->13G(14G), 27.0703975 secs]
      2017-03-21T11:25:46.711-0400: 54263.273: [Full GC (Allocation Failure)  13G->13G(14G), 27.3187768 secs]
      2017-03-21T11:26:15.419-0400: 54291.981: [Full GC (Allocation Failure)  13G->13G(14G), 26.9493405 secs]
      2017-03-21T11:26:43.399-0400: 54319.961: [Full GC (Allocation Failure)  13G->13G(14G), 27.5222085 secs]
      2017-03-21T11:27:11.383-0400: 54347.945: [Full GC (Allocation Failure)  13G->13G(14G), 27.1769581 secs]
      2017-03-21T11:27:40.174-0400: 54376.737: [Full GC (Allocation Failure)  13G->13G(14G), 27.4639031 secs]
      2017-03-21T11:28:08.946-0400: 54405.508: [Full GC (Allocation Failure)  13G->13G(14G), 30.3480523 secs]
      2017-03-21T11:28:40.117-0400: 54436.680: [Full GC (Allocation Failure)  13G->13G(14G), 27.8220513 secs]
      2017-03-21T11:29:08.459-0400: 54465.022: [Full GC (Allocation Failure)  13G->13G(14G), 27.4691271 secs]
      2017-03-21T11:29:37.114-0400: 54493.676: [Full GC (Allocation Failure)  13G->13G(14G), 27.0275733 secs]
      2017-03-21T11:30:04.635-0400: 54521.198: [Full GC (Allocation Failure)  13G->13G(14G), 27.1902627 secs]
      2017-03-21T11:30:32.114-0400: 54548.676: [Full GC (Allocation Failure)  13G->13G(14G), 27.8872850 secs]
      2017-03-21T11:31:01.430-0400: 54577.993: [Full GC (Allocation Failure)  13G->13G(14G), 27.1609706 secs]
      2017-03-21T11:31:29.024-0400: 54605.587: [Full GC (Allocation Failure)  13G->13G(14G), 27.3635138 secs]
      2017-03-21T11:31:57.303-0400: 54633.865: [Full GC (Allocation Failure)  13G->13G(14G), 27.4143510 secs]
      2017-03-21T11:32:25.110-0400: 54661.672: [Full GC (Allocation Failure)  13G->13G(14G), 27.8595986 secs]
      2017-03-21T11:32:53.922-0400: 54690.485: [Full GC (Allocation Failure)  13G->13G(14G), 27.5242543 secs]
      2017-03-21T11:33:21.867-0400: 54718.429: [Full GC (Allocation Failure)  13G->13G(14G), 30.8930130 secs]
      2017-03-21T11:33:53.712-0400: 54750.275: [Full GC (Allocation Failure)  13G->13G(14G), 27.6523013 secs]
      2017-03-21T11:34:21.760-0400: 54778.322: [Full GC (Allocation Failure)  13G->13G(14G), 27.3030198 secs]
      2017-03-21T11:34:50.073-0400: 54806.635: [Full GC (Allocation Failure)  13G->13G(14G), 27.1594154 secs]
      2017-03-21T11:35:17.743-0400: 54834.306: [Full GC (Allocation Failure)  13G->13G(14G), 27.3766949 secs]
      2017-03-21T11:35:45.797-0400: 54862.360: [Full GC (Allocation Failure)  13G->13G(14G), 27.5756770 secs]
      2017-03-21T11:36:13.816-0400: 54890.378: [Full GC (Allocation Failure)  13G->13G(14G), 27.5541813 secs]
      2017-03-21T11:36:41.926-0400: 54918.488: [Full GC (Allocation Failure)  13G->13G(14G), 33.7510103 secs]
      2017-03-21T11:37:16.132-0400: 54952.695: [Full GC (Allocation Failure)  13G->13G(14G), 27.4856611 secs]
      2017-03-21T11:37:44.454-0400: 54981.017: [Full GC (Allocation Failure)  13G->13G(14G), 28.1269335 secs]
      2017-03-21T11:38:12.774-0400: 55009.337: [Full GC (Allocation Failure)  13G->13G(14G), 27.7830448 secs]
      2017-03-21T11:38:40.840-0400: 55037.402: [Full GC (Allocation Failure)  13G->13G(14G), 27.3527326 secs]
      2017-03-21T11:39:08.610-0400: 55065.173: [Full GC (Allocation Failure)  13G->13G(14G), 27.5828941 secs]
      2017-03-21T11:39:36.833-0400: 55093.396: [Full GC (Allocation Failure)  13G->13G(14G), 27.9303030 secs]
      2017-03-21T11:40:05.265-0400: 55121.828: [Full GC (Allocation Failure)  13G->13G(14G), 36.9902867 secs]
      2017-03-21T11:40:42.400-0400: 55158.963: [Full GC (Allocation Failure)  13G->13G(14G), 27.6835744 secs]
      2017-03-21T11:41:10.529-0400: 55187.091: [Full GC (Allocation Failure)  13G->13G(14G), 27.1899555 secs]
      2017-03-21T11:41:38.018-0400: 55214.581: [Full GC (Allocation Failure)  13G->13G(14G), 27.7309706 secs]
      2017-03-21T11:42:06.062-0400: 55242.624: [Full GC (Allocation Failure)  13G->13G(14G), 29.2103234 secs]
      2017-03-21T11:42:35.579-0400: 55272.141: [Full GC (Allocation Failure)  13G->13G(14G), 27.9902741 secs]
      2017-03-21T11:43:03.955-0400: 55300.518: [Full GC (Allocation Failure)  13G->13G(14G), 28.0764545 secs]
      2017-03-21T11:43:32.186-0400: 55328.748: [Full GC (Allocation Failure)  13G->13G(14G), 28.3913296 secs]
      2017-03-21T11:44:00.810-0400: 55357.373: [Full GC (Allocation Failure)  13G->13G(14G), 27.0945131 secs]
      2017-03-21T11:44:28.343-0400: 55384.905: [Full GC (Allocation Failure)  13G->13G(14G), 27.3357763 secs]
      2017-03-21T11:44:55.909-0400: 55412.471: [Full GC (Allocation Failure)  13G->13G(14G), 27.9216626 secs]
      2017-03-21T11:45:24.070-0400: 55440.632: [Full GC (Allocation Failure)  13G->13G(14G), 29.8189953 secs]
      2017-03-21T11:45:54.092-0400: 55470.654: [Full GC (Allocation Failure)  13G->13G(14G), 27.9992085 secs]
      2017-03-21T11:46:22.219-0400: 55498.782: [Full GC (Allocation Failure)  13G->13G(14G), 27.4622100 secs]
      2017-03-21T11:46:49.983-0400: 55526.545: [Full GC (Allocation Failure)  13G->13G(14G), 27.9159143 secs]
      2017-03-21T11:47:18.029-0400: 55554.592: [Full GC (Allocation Failure)  13G->13G(14G), 28.5207919 secs]
      2017-03-21T11:47:46.938-0400: 55583.500: [Full GC (Allocation Failure)  13G->13G(14G), 27.0355291 secs]
      2017-03-21T11:48:14.235-0400: 55610.798: [Full GC (Allocation Failure)  13G->13G(14G), 35.6284865 secs]
      2017-03-21T11:48:50.013-0400: 55646.575: [Full GC (Allocation Failure)  13G->13G(14G), 27.3299061 secs]
      2017-03-21T11:49:17.592-0400: 55674.154: [Full GC (Allocation Failure)  13G->13G(14G), 27.2544348 secs]
      

      When a node enters the above state:, it becomes useless over 1-5 minutes:

      • Client connections die
      • New connections time out
      • CPU usage drops - all threads do nothing except for 1 thread near 100% CPU ("VM Thread" os_prio=0 tid=0x00007ffaf8209bf0 nid=0x2d53 runnable"), doing the above GC in a loop
      • JMX becomes unqueriable
      • Regular kill signal is ignored (needs kill -9)

      I've managed to extract a full heap dump of a node while in this state, but it's 20GB and I have not been able to find a tool that can analyze it.

      I've extracted a histogram of heap objects. From a healthy node:

       num     #instances         #bytes  class name
      ----------------------------------------------
         1:      12688007     1116544616  io.netty.buffer.PooledUnsafeDirectByteBuf
         2:      24938940      798046080  io.netty.util.Recycler$DefaultHandle
         3:      11457548      733283072  io.netty.channel.ChannelOutboundBuffer$Entry
         4:       9582138      459942624  java.nio.HeapByteBuffer
         5:         16974      393432416  [I
         6:       2014719      311356992  [B
         7:       4982167      181324848  [Ljava.lang.Object;
         8:          8531      118782704  [Lio.netty.util.Recycler$DefaultHandle;
         9:       2827662      113106480  org.apache.cassandra.db.rows.BufferCell
        10:       2737816       87610112  org.apache.cassandra.db.rows.BTreeRow
        11:       2740110       70228240  [Ljava.nio.ByteBuffer;
        12:       2738081       65713944  org.apache.cassandra.db.BufferClustering
        13:       1763956       56446592  com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node
        14:       1763736       56439552  com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node
        15:       2205825       52939800  com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue
        16:       1617026       51744832  java.util.concurrent.ConcurrentHashMap$Node
        17:       1769934       42478416  org.apache.cassandra.cache.KeyCacheKey
      

      From an unhealthy node:

       num     #instances         #bytes  class name
      ----------------------------------------------
         1:     103726026     4978849248  java.nio.HeapByteBuffer
         2:      95646048     2832233872  [B
         3:      28076465     1207916952  [Ljava.lang.Object;
         4:      12803448     1126703424  io.netty.buffer.PooledUnsafeDirectByteBuf
         5:      25054634      801748288  io.netty.util.Recycler$DefaultHandle
         6:      11457548      733283072  io.netty.channel.ChannelOutboundBuffer$Entry
         7:      24725380      593409120  java.util.ArrayList
         8:       5808290      185865280  java.util.RegularEnumSet
         9:       2870889      160769784  org.apache.cassandra.transport.messages.ResultMessage$Rows
        10:       2937336      140992128  io.netty.buffer.SlicedAbstractByteBuf
        11:          8854      118773984  [Lio.netty.util.Recycler$DefaultHandle;
        12:       2830805      113232200  org.apache.cassandra.db.rows.BufferCell
        13:       2937336       93994752  org.apache.cassandra.transport.Frame$Header
        14:       2870928       91869696  org.apache.cassandra.cql3.ResultSet$ResultMetadata
        15:       2728627       87316064  org.apache.cassandra.db.rows.BTreeRow
        16:       2947033       70728792  java.util.concurrent.ConcurrentLinkedQueue$Node
        17:       2937336       70496064  org.apache.cassandra.transport.Frame
      

      The increase is clear in these top 3 object types:

         1:     103726026     4978849248  java.nio.HeapByteBuffer
         2:      95646048     2832233872  [B
         3:      28076465     1207916952  [Ljava.lang.Object;
      

      Other related observations:

      Client socket states: Essentially the JVM has stopped working, and the recv-queue is piling up kernel-side but not being consumed app-side. Worth noting that the peers (clients) have given up and closed the connections on their end, but they still appear ESTABlished on cassandra's side since it hasn't drain the read buffer till EOF/error/complete the closure:

      ESTAB      558612 0      172.30.72.133:9042               172.30.72.145:11418
      ESTAB      564172 0      172.30.72.133:9042               172.30.72.145:12420
      ESTAB      545284 0      172.30.72.133:9042               172.30.72.145:18212
      ESTAB      565468 0      172.30.72.133:9042               172.30.72.145:18346
      ESTAB      525884 0      172.30.72.133:9042               172.30.72.145:18352
      ESTAB      532324 0      172.30.72.133:9042               172.30.72.145:28896
      ESTAB      517592 0      172.30.72.133:9042               172.30.72.145:34504
      ESTAB      553392 0      172.30.72.133:9042               172.30.72.145:36926
      ESTAB      581952 0      172.30.72.133:9042               172.30.72.145:40684
      ESTAB      543668 0      172.30.72.133:9042               172.30.72.145:42282
      ESTAB      511860 0      172.30.72.133:9042               172.30.72.145:47350
      ESTAB      590380 0      172.30.72.133:9042               172.30.72.145:47566
      ESTAB      583120 0      172.30.72.133:9042               172.30.72.145:50386
      ESTAB      536772 0      172.30.72.133:9042               172.30.72.145:52656
      ESTAB      31     0      172.30.72.133:9042               172.30.72.145:52946
      ESTAB      532140 0      172.30.72.133:9042               172.30.72.145:53618
      ESTAB      14397  0      172.30.72.133:9042               172.30.72.150:10016
      ESTAB      6444   0      172.30.72.133:9042               172.30.72.150:12176
      ESTAB      7568   0      172.30.72.133:9042               172.30.72.150:12934
      ESTAB      139479 0      172.30.72.133:9042               172.30.72.150:13186
      ESTAB      6896   0      172.30.72.133:9042               172.30.72.150:17316
      ESTAB      7808   0      172.30.72.133:9042               172.30.72.150:17448
      ESTAB      8712   0      172.30.72.133:9042               172.30.72.150:17748
      ESTAB      6944   0      172.30.72.133:9042               172.30.72.150:18472
      ESTAB      136700 0      172.30.72.133:9042               172.30.72.150:19860
      ESTAB      16260  0      172.30.72.133:9042               172.30.72.150:20696
      ESTAB      8943   0      172.30.72.133:9042               172.30.72.150:21208
      ESTAB      132877 0      172.30.72.133:9042               172.30.72.150:21420
      ESTAB      6936   0      172.30.72.133:9042               172.30.72.150:25440
      ESTAB      92     0      172.30.72.133:9042               172.30.72.150:25778
      ESTAB      137312 0      172.30.72.133:9042               172.30.72.150:28258
      ESTAB      40     0      172.30.72.133:9042               172.30.72.150:28348
      ESTAB      6940   0      172.30.72.133:9042               172.30.72.150:29718
      ESTAB      6416   0      172.30.72.133:9042               172.30.72.150:29858
      ESTAB      6580   0      172.30.72.133:9042               172.30.72.150:30518
      ESTAB      4202   0      172.30.72.133:9042               172.30.72.150:30948
      ESTAB      6060   0      172.30.72.133:9042               172.30.72.150:35276
      ESTAB      0      0      172.30.72.133:9042               172.30.72.150:40114
      ESTAB      134303 0      172.30.72.133:9042               172.30.72.150:42212
      ESTAB      396    0      172.30.72.133:9042               172.30.72.150:44848
      ESTAB      136450 0      172.30.72.133:9042               172.30.72.150:47742
      ESTAB      8318   0      172.30.72.133:9042               172.30.72.150:51054
      ESTAB      11976  0      172.30.72.133:9042               172.30.72.150:54560
      ESTAB      136193 0      172.30.72.133:9042               172.30.72.150:56094
      ESTAB      138737 0      172.30.72.133:9042               172.30.72.150:60382
      ESTAB      7996   0      172.30.72.133:9042               172.30.72.150:60932
      ESTAB      137235 0      172.30.72.133:9042               172.30.72.150:61568
      ESTAB      8184   0      172.30.72.133:9042               172.30.72.150:63054
      ESTAB      985276 0      172.30.72.133:9042               172.30.72.175:23494
      ESTAB      601528 0      172.30.72.133:9042               172.30.72.175:24140
      ESTAB      557844 0      172.30.72.133:9042               172.30.72.175:42248
      ESTAB      469344 0      172.30.72.133:9042               172.30.72.175:59704
      ESTAB      507160 0      172.30.72.133:9042               172.30.72.185:11958
      ESTAB      449196 0      172.30.72.133:9042               172.30.72.185:15458
      ESTAB      487796 0      172.30.72.133:9042               172.30.72.185:16458
      ESTAB      441132 0      172.30.72.133:9042               172.30.72.185:23194
      ESTAB      486664 0      172.30.72.133:9042               172.30.72.185:25276
      ESTAB      554780 0      172.30.72.133:9042               172.30.72.185:29102
      ESTAB      496248 0      172.30.72.133:9042               172.30.72.185:29428
      ESTAB      420048 0      172.30.72.133:9042               172.30.72.185:39650
      ESTAB      498704 0      172.30.72.133:9042               172.30.72.185:46424
      ESTAB      506488 0      172.30.72.133:9042               172.30.72.185:49722
      ESTAB      509416 0      172.30.72.133:9042               172.30.72.185:55838
      ESTAB      511956 0      172.30.72.133:9042               172.30.72.185:57670
      ESTAB      496720 0      172.30.72.133:9042               172.30.72.185:59884
      

      perf top during this state:

        19.73%  libjvm.so                                              [.] InstanceKlass::oop_follow_contents
        14.05%  libjvm.so                                              [.] ObjArrayKlass::oop_follow_contents
         7.83%  libjvm.so                                              [.] MarkSweep::follow_stack
         6.38%  libjvm.so                                              [.] HeapRegion::block_size
         5.41%  libjvm.so                                              [.] OtherRegionsTable::add_reference
         5.12%  libjvm.so                                              [.] UpdateRSOopClosure::do_oop
         4.77%  libjvm.so                                              [.] HeapRegion::block_is_obj
      

      Frequency: Affecting nodes in our 2 busier DCs (other 2 DCs serve less volume). ~ 2-10 times per day. Not always same nodes (though lately 2 hiccup more frequently than others).

      Tried:

      • Upgrading cassandra from 3.9 to 3.10
      • Increasing heap size from 8GB to 14GB
      • Upgrading JVM 1.8 from u112 to u121
      • Switching from CMS to G1GC

      The last change, switching from CMS to G1GC reduced the frequency of the problem but has not eliminated it.

      Problem 2: Read volume in DC depressed

      With our RF set to 2 per DC, if 2 or more nodes in a DC enter the above GC loop described in problem 1, all nodes in the same DC accumulate very high active/pending/blocked Task-Transport queues.

      Total read volume for the DC drops to 10%-30% of normal.

      Tried:

      • Increasing -Dcassandra.max_queued_native_transport_requests to 4096
      • Reducing read_request_timeout_in_ms to 400
      • Reducing write_request_timeout_in_ms to 800

      Improved things slightly but not too much.

      Attachments

        Activity

          People

            Unassigned Unassigned
            minaguib Mina Naguib
            Adam Adam
            Votes:
            2 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated: