Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-8447

Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Duplicate
    • 2.0.12
    • None
    • None
    • Normal

    Description

      Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC.

      Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up.

      Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=1900000000 -rate threads=<different threads tested> -schema replication(factor=3) keyspace="Keyspace1" -node <all nodes listed>

      Data load thread count and results:

      • 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node)
      • 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node)
      • 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range
      • 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node)
      • 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node)
      • 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node)

      Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior.

      Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads.

      JVM settings tested:

      1. default, out of the box, env-sh settings
      2. 10 G Max | 1 G New - default env-sh settings
      3. 10 G Max | 1 G New - default env-sh settings
        • JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50"
      4. 20 G Max | 10 G New
        JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
        JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
        JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
        JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
        JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
        JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
        JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
        JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
        JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
        JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
        JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
        JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
        JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
        JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
        JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
        JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
        JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
        JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
      5. 20 G Max | 1 G New
        JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
        JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
        JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
        JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
        JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
        JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
        JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
        JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
        JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
        JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
        JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
        JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
        JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
        JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
        JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
        JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
        JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
        JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"

      Linux OS settings tested:

      1. Disabled Transparent Huge Pages
        echo never > /sys/kernel/mm/transparent_hugepage/enabled
        echo never > /sys/kernel/mm/transparent_hugepage/defrag
      2. Enabled Huge Pages
        echo 21500000000 > /proc/sys/kernel/shmmax (over 20GB for heap)
        echo 1536 > /proc/sys/vm/nr_hugepages (20GB/2MB page size)
      3. Disabled NUMA
        numa-off in /etc/grub.confdatastax
      4. Verified all settings documented here were implemented
        http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html

      Attachments:

      1. .yaml
      2. fio output - results.tar.gz
      3. 50 thread heap dump - https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing
      4. 100 thread - visual vm anonymous screenshot - visualvm_screenshot
      5. dstat screen shot of with compaction - Node_with_compaction.png
      6. dstat screen shot of without compaction – Node_without_compaction.png
      7. gcinspector messages from system.log
      8. gc.log output - gc.logs.tar.gz

      Observations:

      1. even though this is a spinning disk implementation, disk io looks good.
      2. cpu usage is moderate until large GC events occur
      3. once old gen heap fills up and cannot clean, memtable post flushers start to back up (show a lot pending) via tpstats
      4. the node itself, i.e. ssh, is still responsive but the Cassandra instance becomes unresponsive
      5. once old gen heap fills up Cassandra stress starts to throw CL ONE errors stating there aren't enough replicas to satisfy....
      6. heap dump from 50 thread, JVM scenario 1 is attached
        • appears to show a compaction thread consuming a lot of memory
      7. sample system.log output for gc issues
      8. strace -e futex -p $PID -f -c output during 100 thread load and during old gen "filling", just before full
        % time seconds usecs/call calls errors syscall
        100.00 244.886766 4992 49052 7507 futex
        100.00 244.886766 49052 7507 total
      9. htop during full gc cycle - https://s3.amazonaws.com/uploads.hipchat.com/6528/480117/4ZlgcoNScb6kRM2/upload.png
      10. nothing is blocked via tpstats on these nodes
      11. compaction does have pending tasks, upwards of 20, on the nodes
      12. Nodes without compaction achieved approximately 20k writes per second per node without errors or drops

      Next Steps:

      1. Will try to create a flame graph and update load here - http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
      2. Will try to recreate in another environment

      Attachments

        1. visualvm_screenshot
          245 kB
          jonathan lacefield
        2. results.tar.gz
          22 kB
          jonathan lacefield
        3. output.svg
          771 kB
          jonathan lacefield
        4. output.2.svg
          1.35 MB
          jonathan lacefield
        5. output.1.svg
          1.33 MB
          jonathan lacefield
        6. Node_without_compaction.png
          957 kB
          jonathan lacefield
        7. Node_with_compaction.png
          935 kB
          jonathan lacefield
        8. memtable_debug
          1.98 MB
          jonathan lacefield
        9. gcinspector_messages.txt
          17 kB
          jonathan lacefield
        10. gc.logs.tar.gz
          781 kB
          jonathan lacefield
        11. cassandra.yaml
          32 kB
          jonathan lacefield

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            jlacefie jonathan lacefield
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment