[CASSANDRA-8447] Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Duplicate
Fix Version/s: 2.0.12
Component/s: None
Labels:
None
Environment:

Hide

Cluster size - 4 nodes
Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
OS - RHEL 6.5
jvm - oracle 1.7.0_71
Cassandra version 2.0.11

Show
Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11

Severity:
Normal
Since Version:

2.0.11

Description

Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC.

Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up.

Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=1900000000 -rate threads=<different threads tested> -schema replication(factor=3) keyspace="Keyspace1" -node <all nodes listed>

Data load thread count and results:

1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node)
5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node)
10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range
50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node)
100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node)
200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node)

Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior.

Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads.

JVM settings tested:

default, out of the box, env-sh settings
10 G Max | 1 G New - default env-sh settings
10 G Max | 1 G New - default env-sh settings
- JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50"
20 G Max | 10 G New
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
20 G Max | 1 G New
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=8"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=12"
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=12"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"

Linux OS settings tested:

Disabled Transparent Huge Pages
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Enabled Huge Pages
echo 21500000000 > /proc/sys/kernel/shmmax (over 20GB for heap)
echo 1536 > /proc/sys/vm/nr_hugepages (20GB/2MB page size)
Disabled NUMA
numa-off in /etc/grub.confdatastax
Verified all settings documented here were implemented
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html

Attachments:

.yaml
fio output - results.tar.gz
50 thread heap dump - https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing
100 thread - visual vm anonymous screenshot - visualvm_screenshot
dstat screen shot of with compaction - Node_with_compaction.png
dstat screen shot of without compaction – Node_without_compaction.png
gcinspector messages from system.log
gc.log output - gc.logs.tar.gz

Observations:

even though this is a spinning disk implementation, disk io looks good.
- output from Jshook perf monitor https://github.com/jshook/perfscripts is attached
- note, we leveraged direct io for all tests by adding direct=1 to the .global config files
cpu usage is moderate until large GC events occur
once old gen heap fills up and cannot clean, memtable post flushers start to back up (show a lot pending) via tpstats
the node itself, i.e. ssh, is still responsive but the Cassandra instance becomes unresponsive
once old gen heap fills up Cassandra stress starts to throw CL ONE errors stating there aren't enough replicas to satisfy....
heap dump from 50 thread, JVM scenario 1 is attached
- appears to show a compaction thread consuming a lot of memory
sample system.log output for gc issues
strace -e futex -p $PID -f -c output during 100 thread load and during old gen "filling", just before full
% time seconds usecs/call calls errors syscall
100.00 244.886766 4992 49052 7507 futex
100.00 244.886766 49052 7507 total
htop during full gc cycle - https://s3.amazonaws.com/uploads.hipchat.com/6528/480117/4ZlgcoNScb6kRM2/upload.png
nothing is blocked via tpstats on these nodes
compaction does have pending tasks, upwards of 20, on the nodes
Nodes without compaction achieved approximately 20k writes per second per node without errors or drops

Next Steps:

Will try to create a flame graph and update load here - http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
Will try to recreate in another environment

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

output.1.svg
11/Dec/14 14:40
1.33 MB
jonathan lacefield
output.2.svg
11/Dec/14 14:40
1.35 MB
jonathan lacefield
output.svg
11/Dec/14 12:46
771 kB
jonathan lacefield
memtable_debug
10/Dec/14 19:33
1.98 MB
jonathan lacefield
gcinspector_messages.txt
09/Dec/14 20:03
17 kB
jonathan lacefield
visualvm_screenshot
09/Dec/14 20:03
245 kB
jonathan lacefield
results.tar.gz
09/Dec/14 20:03
22 kB
jonathan lacefield
Node_without_compaction.png
09/Dec/14 20:03
957 kB
jonathan lacefield
Node_with_compaction.png
09/Dec/14 20:03
935 kB
jonathan lacefield
gc.logs.tar.gz
09/Dec/14 20:03
781 kB
jonathan lacefield
cassandra.yaml
09/Dec/14 20:03
32 kB
jonathan lacefield

Issue Links

duplicates

CASSANDRA-8485 Move 2.0 metered flusher to its own thread

Resolved

Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates