[CASSANDRA-9549] Memory leak in Ref.GlobalState due to pathological ConcurrentLinkedQueue.remove behaviour - ASF JIRA

Agile Board

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Move

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Fixed
Fix Version/s: 2.1.7
Component/s: None
Labels:
None
Environment:

Hide

Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone

JVM: /usr/java/jdk1.8.0_40/jre/bin/java
JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid

Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Show
Cassandra 2.1.5. 9 node cluster in EC2 (m1.large nodes, 2 cores 7.5G memory, 800G platter for cassandra data, root partition and commit log are on SSD EBS with sufficient IOPS), 3 nodes/availablity zone, 1 replica/zone JVM: /usr/java/jdk1.8.0_40/jre/bin/java JVM Flags besides CP: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2G -Xmx2G -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid Kernel: Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Severity:
Critical
Since Version:

2.1.5

Description

We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem.

As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement.

Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior.

Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK detected" messages:

ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]

ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected

This might be related to ~~CASSANDRA-8723~~?

Attachments

c4_system.log
04/Jun/15 20:36
8.50 MB
Ivar Thorson
c7fromboot.zip
05/Jun/15 16:47
5.66 MB
Ivar Thorson
cassandra.yaml
04/Jun/15 17:51
35 kB
Ivar Thorson
cpu-load.png
04/Jun/15 17:41
57 kB
Ivar Thorson
memoryuse.png
04/Jun/15 17:41
47 kB
Ivar Thorson
ref-java-errors.jpeg
04/Jun/15 20:35
64 kB
Ivar Thorson
suspect.png
04/Jun/15 17:41
107 kB
Ivar Thorson
two-loads.png
04/Jun/15 20:02
204 kB
Ivar Thorson

Issue Links

Add Link

is duplicated by

CASSANDRA-10548 OOM in Ref#GlobalState

Resolved

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Benedict Elliott Smith Assign to me

Reporter:: Ivar Thorson

Authors:: Benedict Elliott Smith

Reviewers:: Marcus Eriksson

Votes:: 2 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 04/Jun/15 17:41

Updated:: 16/Apr/19 09:31

Resolved:: 17/Jun/15 16:15

Agile

View on Board

Memory leak in Ref.GlobalState due to pathological ConcurrentLinkedQueue.remove behaviour

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment