Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19785

Possible memory leak in BTree.FastBuilder

    XMLWordPrintableJSON

Details

    Description

      We are having a problem with the heap growing in size, This is a large cluster > 1,000 nodes across a large number of dc’s. This is running version 4.0.11.

       

      Each node has a 32GB heap, and the amount used continues to grow until it reaches 30GB, it then struggles with multiple Full GC pauses, as can be seen here:

      We took 2 heap dumps on one node a few days after it was restarted, and the heap had grown by 2.7GB

       

      9th July

      11th July

      This can be seen as mainly an increase of memory used by FastThreadLocalThread, increasing from 5.92GB to 8.53GB

      Looking deeper into this it can be seen that the growing heap is contained within the threads for the MutationStage, Native-transport-Requests, ReadStage etc. We would expect the memory used within these threads to be short lived, and not grow as time goes on.  We recently increased the size of theses threadpools, and that has increased the size of the problem.

       

      Top memory usage for FastThreadLocalThread

      9th July

      11th July


      This has led us to investigate whether there could be a memory leak, and we have found the following issues within the retained references in BTree.FastBuilder objects. The issue appears to stem from the reset() method, which does not properly clear all buffers.  We are not really sure how the BTree.FastBuilder works, but this this is our analysis of where a leak might occur.

       

      Specifically:

      Leaf Buffer Not Being Cleared:
      When leaf().count is 0, the statement Arrays.fill(leaf().buffer, 0, leaf().count, null); does not clear the buffer because the end index is 0. This leaves the buffer with references to potentially large objects, preventing garbage collection and increasing heap usage.

      Branch inUse Property:
      If the inUse property of the branch is set to false elsewhere in the code, the while loop while (branch != null && branch.inUse) does not execute, resulting in uncleared branch buffers and retained references.

       

      This is based on the following observations:

          Heap Dumps: Analysis of heap dumps shows that leaf().count is often 0, and as a result, the buffer is not being cleared, leading to high heap utilization.

          Remote Debugging: Debugging sessions indicate that the drain() method sets count to 0, and the inUse flag for the parent branch is set to false, preventing the while loop in reset() from clearing the branch buffers.

       

      Attachments

        1. image-2024-07-19-08-47-34-582.png
          321 kB
          Paul Chandler
        2. image-2024-07-19-08-47-19-517.png
          311 kB
          Paul Chandler
        3. image-2024-07-19-08-46-56-594.png
          540 kB
          Paul Chandler
        4. image-2024-07-19-08-46-42-979.png
          531 kB
          Paul Chandler
        5. image-2024-07-19-08-46-06-919.png
          148 kB
          Paul Chandler
        6. image-2024-07-19-08-45-50-383.png
          149 kB
          Paul Chandler
        7. image-2024-07-19-08-45-33-933.png
          28 kB
          Paul Chandler
        8. image-2024-07-19-08-45-17-289.png
          29 kB
          Paul Chandler
        9. image-2024-07-19-08-44-56-714.png
          83 kB
          Paul Chandler

        Issue Links

          Activity

            People

              benedict Benedict Elliott Smith
              paulchandler Paul Chandler
              Benedict Elliott Smith
              Branimir Lambov
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m