Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6653

Attempting to bootstrap causes nodes to lock up in GC

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Not A Problem
    • None
    • None
    • None
    • VNodes using Murmur3

    • Normal

    Description

      We have been struggling with the inability to bootstrap nodes into our 1.2.13 environment with Vnodes using centos 6.4 with Java 7. We have an 8 node cluster (32 GB RAM, dual hex core, SSDs, 8 GB heap with 1200 MB eden space, RF3) with around 1 TB per node using murmur3. When we go to bootstrap a new node this is what we see:

      • Bootstrapping node assigns tokens and requests data from cluster
      • 5-6 nodes within the cluster begin to stream data
      • Around 2 minutes after bootstrap start, between 1 and 4 nodes (sometimes the bootstrapping node and sometimes not) become unresponsive in par new GCs
      • IF no nodes go down during the first 5 minutes of bootstrap, then the bootstrap will succeed without issue
        GC mired nodes tend to recover after a minute or two but the receiving node stops attempting to get more data from the nodes
      • Bootstrap eventually fails (after streaming all the data from nodes that did not go down) with Unable to Fetch Ranges

      We have tried the following and it appears that sometimes a bootstrap will succeed (perhaps 1 in 10) but with no discernible pattern:

      • Increase phi_convict to 16
      • Restart all nodes prior to bootstrap (to ensure heap is as “clean” as possible)
      • Stop production load against the cluster (to reduce par new churn); after 5 minutes we know if the bootstrap will succeed so we then re-enable load
      • Distribute soft interrupts across all CPUs

      Below is an output from the GC log of the bootstrapping node when it was stuck in GC.

      {Heap before GC invocations=109 (full 0):
      par new generation total 1105920K, used 1021140K [0x00000005fae00000, 0x0000000645e00000, 0x0000000645e00000)
      eden space 983040K, 100% used [0x00000005fae00000, 0x0000000636e00000, 0x0000000636e00000)
      from space 122880K, 31% used [0x000000063e600000, 0x0000000640b350f0, 0x0000000645e00000)
      to space 122880K, 0% used [0x0000000636e00000, 0x0000000636e00000, 0x000000063e600000)
      concurrent mark-sweep generation total 7159808K, used 3826815K [0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)
      concurrent-mark-sweep perm gen total 24512K, used 24368K [0x00000007fae00000, 0x00000007fc5f0000, 0x0000000800000000)
      2014-02-05T13:27:49.621+0000: 210.242: [GC 210.242: [ParNew: 1021140K->122880K(1105920K), 0.2963210 secs] 4847955K->4024095K(8265728K), 0.2965270 secs] [Times: user=4.97 sys=0.00, real=0.30 secs]
      Heap after GC invocations=110 (full 0):
      par new generation total 1105920K, used 122880K [0x00000005fae00000, 0x0000000645e00000, 0x0000000645e00000)
      eden space 983040K, 0% used [0x00000005fae00000, 0x00000005fae00000, 0x0000000636e00000)
      from space 122880K, 100% used [0x0000000636e00000, 0x000000063e600000, 0x000000063e600000)
      to space 122880K, 0% used [0x000000063e600000, 0x000000063e600000, 0x0000000645e00000)
      concurrent mark-sweep generation total 7159808K, used 3901215K [0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)
      concurrent-mark-sweep perm gen total 24512K, used 24368K [0x00000007fae00000, 0x00000007fc5f0000, 0x0000000800000000)
      }
      Total time for which application threads were stopped: 0.2968550 seconds
      Application time: 1.5953840 seconds
      Total time for which application threads were stopped: 0.0002040 seconds
      Application time: 0.0000510 seconds

      Relevant portion of GC log from non-bootstrapping node:

      {Heap before GC invocations=518 (full 1):
      par new generation total 1105920K, used 17921K [0x00000005fae00000, 0x0000000645e00000, 0x0000000645e00000)
      eden space 983040K, 1% used [0x00000005fae00000, 0x00000005fbf29db8, 0x0000000636e00000)
      from space 122880K, 0% used [0x0000000636e00000, 0x0000000636e56938, 0x000000063e600000)
      to space 122880K, 0% used [0x000000063e600000, 0x000000063e600000, 0x0000000645e00000)
      concurrent mark-sweep generation total 7159808K, used 6367511K [0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)
      concurrent-mark-sweep perm gen total 29888K, used 29784K [0x00000007fae00000, 0x00000007fcb30000, 0x0000000800000000)
      2014-02-04T16:16:44.471+0000: 945.646: [GC 945.646: [ParNew: 17921K->364K(1105920K), 0.0090810 secs]945.655: [CMS2014-02-04T16:16:48.373+0000: 949.548: [CMS-concurrent-sweep: 3.938/4.362 secs] [Times: user=9.10 sys=0.19, real=4.36 secs]
      (concurrent mode failure): 6367540K->4453666K(7159808K), 16.4971830 secs] 6385433K->4453666K(8265728K), [CMS Perm : 29784K->29740K(29888K)], 16.5083610 secs] [Times: user=16.61 sys=0.00, real=16.50 secs]
      Heap after GC invocations=519 (full 2):
      par new generation total 1105920K, used 0K [0x00000005fae00000, 0x0000000645e00000, 0x0000000645e00000)
      eden space 983040K, 0% used [0x00000005fae00000, 0x00000005fae00000, 0x0000000636e00000)
      from space 122880K, 0% used [0x000000063e600000, 0x000000063e600000, 0x0000000645e00000)
      to space 122880K, 0% used [0x0000000636e00000, 0x0000000636e00000, 0x000000063e600000)
      concurrent mark-sweep generation total 7159808K, used 4453666K [0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)
      concurrent-mark-sweep perm gen total 49568K, used 29740K [0x00000007fae00000, 0x00000007fde68000, 0x0000000800000000)
      }
      Total time for which application threads were stopped: 16.5111290 seconds
      Application time: 0.0094340 seconds
      Total time for which application threads were stopped: 0.2300320 seconds
      Application time: 0.0000550 seconds
      2014-02-04T16:17:01.221+0000: 962.396: [GC [1 CMS-initial-mark: 5584142K(7159808K)] 5614460K(8265728K), 0.0240930 secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
      Total time for which application threads were stopped: 0.0259820 seconds
      2014-02-04T16:17:01.246+0000: 962.420: [CMS-concurrent-mark-start]
      Application time: 0.6072530 seconds
      {Heap before GC invocations=519 (full 3):
      par new generation total 1105920K, used 983040K [0x00000005fae00000, 0x0000000645e00000, 0x0000000645e00000)
      eden space 983040K, 100% used [0x00000005fae00000, 0x0000000636e00000, 0x0000000636e00000)
      from space 122880K, 0% used [0x000000063e600000, 0x000000063e600000, 0x0000000645e00000)
      to space 122880K, 0% used [0x0000000636e00000, 0x0000000636e00000, 0x000000063e600000)
      concurrent mark-sweep generation total 7159808K, used 5584142K [0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)
      concurrent-mark-sweep perm gen total 49568K, used 29776K [0x00000007fae00000, 0x00000007fde68000, 0x0000000800000000)
      2014-02-04T16:17:01.855+0000: 963.030: [GC 963.030: [ParNew: 983040K->21724K(1105920K), 0.0249840 secs] 6567182K->5605866K(8265728K), 0.0254720 secs] [Times: user=0.42 sys=0.00, real=0.02 secs]
      Heap after GC invocations=520 (full 3):
      par new generation total 1105920K, used 21724K [0x00000005fae00000, 0x0000000645e00000, 0x0000000645e00000)
      eden space 983040K, 0% used [0x00000005fae00000, 0x00000005fae00000, 0x0000000636e00000)
      from space 122880K, 17% used [0x0000000636e00000, 0x0000000638337290, 0x000000063e600000)
      to space 122880K, 0% used [0x000000063e600000, 0x000000063e600000, 0x0000000645e00000)
      concurrent mark-sweep generation total 7159808K, used 5584142K [0x0000000645e00000, 0x00000007fae00000, 0x00000007fae00000)
      concurrent-mark-sweep perm gen total 49568K, used 29776K [0x00000007fae00000, 0x00000007fde68000, 0x0000000800000000)
      }
      Total time for which application threads were stopped: 0.0280980 seconds
      Application time: 0.5525030 seconds
      Total time for which application threads were stopped: 0.0547220 seconds
      Application time: 0.0009410 seconds
      {Heap before GC invocations=520 (full 3):

      Java flags:

      /usr/java/default/bin/java -ea -javaagent:/usr/local/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8G -Xmx8G -Xmn1200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedOops -XX:+UseTLAB -XX:+UseCondCardMark -XX:-UseBiasedLocking -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/var/log/cassandra/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.util.logging.config.file=/usr/local/cassandra/conf/logging.properties -javaagent:/usr/local/cassandra/lib/jmxetric-1.0.4.jar=host=lxpcas001.nanigans.com,port=8649,wireformat31x=true,config=/usr/local/cassandra/conf/jmxetric.xml -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true -Dcassandra-foreground=yes -cp /usr/local/cassandra/conf:/usr/local/cassandra/build/classes/main:/usr/local/cassandra/build/classes/thrift:/usr/local/cassandra/lib/antlr-3.2.jar:/usr/local/cassandra/lib/apache-cassandra-1.2.12.jar:/usr/local/cassandra/lib/apache-cassandra-clientutil-1.2.12.jar:/usr/local/cassandra/lib/apache-cassandra-thrift-1.2.12.jar:/usr/local/cassandra/lib/avro-1.4.0-fixes.jar:/usr/local/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/local/cassandra/lib/commons-cli-1.1.jar:/usr/local/cassandra/lib/commons-codec-1.2.jar:/usr/local/cassandra/lib/commons-lang-2.6.jar:/usr/local/cassandra/lib/compress-lzf-0.8.4.jar:/usr/local/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/local/cassandra/lib/gmetric4j-1.0.3.jar:/usr/local/cassandra/lib/guava-13.0.1.jar:/usr/local/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/local/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/local/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/local/cassandra/lib/jamm-0.2.5.jar:/usr/local/cassandra/lib/jbcrypt-0.3m.jar:/usr/local/cassandra/lib/jline-1.0.jar:/usr/local/cassandra/lib/jmxetric-1.0.4.jar:/usr/local/cassandra/lib/jna.jar:/usr/local/cassandra/lib/json-simple-1.1.jar:/usr/local/cassandra/lib/libthrift-0.7.0.jar:/usr/local/cassandra/lib/log4j-1.2.16.jar:/usr/local/cassandra/lib/lz4-1.1.0.jar:/usr/local/cassandra/lib/metrics-core-2.2.0.jar:/usr/local/cassandra/lib/netty-3.6.6.Final.jar:/usr/local/cassandra/lib/oncrpc-1.0.7.jar:/usr/local/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/local/cassandra/lib/slf4j-api-1.7.2.jar:/usr/local/cassandra/lib/slf4j-log4j12-1.7.2.jar:/usr/local/cassandra/lib/snakeyaml-1.6.jar:/usr/local/cassandra/lib/snappy-java-1.0.5.jar:/usr/local/cassandra/lib/snaptree-0.1.jar org.apache.cassandra.service.CassandraDaemon

      Attachments

        Activity

          People

            Unassigned Unassigned
            keithwrightbos Keith Wright
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: