Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.6
    • Component/s: Core
    • Labels:
      None

      Description

      Running 0.5.

      We had a node reboot, and when it came back up, it was unable to replay the commit logs, it would OOM every time.

      We upped the max heap to 6 gigs, but it didn't help.

      I have a heap dump and have it opened in Eclipse MAT.

      Anything specific I should pull out?

      Class Name | Shallow Heap | Retained Heap | Percentage
      ----------------------------------------------------------------------------------------------------------------------------------------

         

      org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor @ 0x7fad06454f48 | 112 | 2,604,583,312 | 84.35%

      • java.util.concurrent.LinkedBlockingQueue @ 0x7fad06405b78
      80 2,604,579,864 84.35%
      • java.util.HashSet @ 0x7fad071b2110
      24 2,952 0.00%
      • java.lang.String @ 0x7fad071b3588 org.apache.cassandra.concurrent:type=ROW-MUTATION-STAGE
      40 176 0.00%
      • org.apache.cassandra.concurrent.NamedThreadFactory @ 0x7fad0714e998
      32 56 0.00%
      • java.util.concurrent.locks.ReentrantLock$NonfairSync @ 0x7fad0722b570
      48 48 0.00%
      • java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject @ 0x7fad071b3560
      40 40 0.00%
      • java.util.concurrent.atomic.AtomicInteger @ 0x7fad071b20e0
      24 24 0.00%
      • java.util.concurrent.locks.ReentrantLock @ 0x7fad071b20f8
      24 24 0.00%
      • java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy @ 0x7fad071b2128
      16 16 0.00%
      '- Total: 9 entries
         

      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad063b3390 | 104 | 95,871,520 | 3.10%
      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad063b3460 | 104 | 82,056,536 | 2.66%
      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad063b3188 | 104 | 72,894,176 | 2.36%
      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad06331798 | 104 | 45,394,360 | 1.47%
      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad06331660 | 104 | 33,895,560 | 1.10%
      java.lang.Thread @ 0x7fad060e2320 main Native Stack, Thread | 168 | 33,573,328 | 1.09%
      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad06287e50 | 104 | 33,528,680 | 1.09%
      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad063b3530 | 104 | 33,423,720 | 1.08%
      org.apache.cassandra.db.ColumnFamilyStore @ 0x7fad063b32c0 | 104 | 20,221,624 | 0.65%
      org.apache.cassandra.db.Memtable @ 0x7fad06410120 | 96 | 9,383,880 | 0.30%
      org.apache.cassandra.db.ColumnFamily @ 0x7fad33df3690 | 80 | 1,190,016 | 0.04%
      org.apache.cassandra.io.SSTableReader @ 0x7fad0653a4a0 | 72 | 587,008 | 0.02%
      org.apache.cassandra.io.SSTableReader @ 0x7fad06c41ef0 | 72 | 543,552 | 0.02%
      org.apache.cassandra.io.SSTableReader @ 0x7fad06456450 | 72 | 541,320 | 0.02%
      Total: 15 of 49,239 entries | | |
      ----------------------------------------------------------------------------------------------------------------------------------------

      1. 885-v3.txt
        4 kB
        Jonathan Ellis

        Activity

        Hide
        Ryan King added a comment -

        Did you have a large number of hints on this node?

        Show
        Ryan King added a comment - Did you have a large number of hints on this node?
        Hide
        Paul Querna added a comment -

        It should not of had hinted handoff, it was the only machine that was down at this time.

        Show
        Paul Querna added a comment - It should not of had hinted handoff, it was the only machine that was down at this time.
        Hide
        Brandon Williams added a comment -

        Perhaps this is related to CASSANDRA-896? I think Paul mentioned on irc that most of his heap was in LinkedBlockingQueue.

        Show
        Brandon Williams added a comment - Perhaps this is related to CASSANDRA-896 ? I think Paul mentioned on irc that most of his heap was in LinkedBlockingQueue.
        Hide
        Jonathan Ellis added a comment -

        it should definitely do a full GC before OOMing, though. the jvm is quite thorough about that (hence the gc storm behavior as you approach the limit)

        Show
        Jonathan Ellis added a comment - it should definitely do a full GC before OOMing, though. the jvm is quite thorough about that (hence the gc storm behavior as you approach the limit)
        Hide
        Jonathan Ellis added a comment -

        If I'm reading the MAT thing correctly, it's saying 85% of the ram is used by the ROW-MUTATION executor queue, meaning that commitlog recovery was reading mutations + shoving them in the queue faster than they could be consumed and applied.

        Applying the stage limits from the first patch of CASSANDRA-685 should fix this, although the rest of 685 will have to wait for 0.7.

        Show
        Jonathan Ellis added a comment - If I'm reading the MAT thing correctly, it's saying 85% of the ram is used by the ROW-MUTATION executor queue, meaning that commitlog recovery was reading mutations + shoving them in the queue faster than they could be consumed and applied. Applying the stage limits from the first patch of CASSANDRA-685 should fix this, although the rest of 685 will have to wait for 0.7.
        Hide
        Jonathan Ellis added a comment -

        attached.

        Show
        Jonathan Ellis added a comment - attached.
        Hide
        Jonathan Ellis added a comment -

        v2 fixes buggy assert

        Show
        Jonathan Ellis added a comment - v2 fixes buggy assert
        Hide
        Jonathan Ellis added a comment -

        v3 sets minimum of 2 threads on read, write, and response threads

        Show
        Jonathan Ellis added a comment - v3 sets minimum of 2 threads on read, write, and response threads
        Hide
        Brandon Williams added a comment -

        +1, performance is still good.

        Show
        Brandon Williams added a comment - +1, performance is still good.
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed

          People

          • Assignee:
            Jonathan Ellis
            Reporter:
            Paul Querna
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development