Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-9549

Memory leak in Ref.GlobalState due to pathological ConcurrentLinkedQueue.remove behaviour

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Urgent
    • Resolution: Fixed
    • 2.1.7
    • None
    • None
    • Critical

    Description

      We have been experiencing a severe memory leak with Cassandra 2.1.5 that, over the period of a couple of days, eventually consumes all of the available JVM heap space, putting the JVM into GC hell where it keeps trying CMS collection but can't free up any heap space. This pattern happens for every node in our cluster and is requiring rolling cassandra restarts just to keep the cluster running. We have upgraded the cluster per Datastax docs from the 2.0 branch a couple of months ago and have been using the data from this cluster for more than a year without problem.

      As the heap fills up with non-GC-able objects, the CPU/OS load average grows along with it. Heap dumps reveal an increasing number of java.util.concurrent.ConcurrentLinkedQueue$Node objects. We took heap dumps over a 2 day period, and watched the number of Node objects go from 4M, to 19M, to 36M, and eventually about 65M objects before the node stops responding. The screen capture of our heap dump is from the 19M measurement.

      Load on the cluster is minimal. We can see this effect even with only a handful of writes per second. (See attachments for Opscenter snapshots during very light loads and heavier loads). Even with only 5 reads a sec we see this behavior.

      Log files show repeated errors in Ref.java:181 and Ref.java:279 and "LEAK detected" messages:

      ERROR [CompactionExecutor:557] 2015-06-01 18:27:36,978 Ref.java:279 - Error when closing class org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier@1302301946:/data1/data/ourtablegoeshere-ka-1150
      java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@32680b31 rejected from org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor@573464d6[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1644]
      
      ERROR [Reference-Reaper:1] 2015-06-01 18:27:37,083 Ref.java:181 - LEAK DETECTED: a reference (org.apache.cassandra.utils.concurrent.Ref$State@74b5df92) to class org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@2054303604:/data2/data/ourtablegoeshere-ka-1151 was not released before the reference was garbage collected
      

      This might be related to CASSANDRA-8723?

      Attachments

        1. c4_system.log
          8.50 MB
          Ivar Thorson
        2. c7fromboot.zip
          5.66 MB
          Ivar Thorson
        3. cassandra.yaml
          35 kB
          Ivar Thorson
        4. cpu-load.png
          57 kB
          Ivar Thorson
        5. memoryuse.png
          47 kB
          Ivar Thorson
        6. ref-java-errors.jpeg
          64 kB
          Ivar Thorson
        7. suspect.png
          107 kB
          Ivar Thorson
        8. two-loads.png
          204 kB
          Ivar Thorson

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            benedict Benedict Elliott Smith Assign to me
            ivar.thorson Ivar Thorson
            Benedict Elliott Smith
            Marcus Eriksson
            Votes:
            2 Vote for this issue
            Watchers:
            22 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment