Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13687

Abnormal heap growth and CPU usage during repair.

    XMLWordPrintableJSON

Details

    • Normal

    Description

      We recently upgraded from 3.0.9 to 3.0.14 to get the fix from CASSANDRA-13004

      Sadly 3 out of the last 7 nights we have had to wake up due Cassandra dying on us. We currently don't have any data to help reproduce this, but maybe since there aren't many commits between the 2 versions it might be obvious.

      Basically we trigger a parallel incremental repair from a single node every night at 1AM. That node will sometimes start allocating a lot and keeping the heap maxed and triggering GC. Some of these GC can last up to 2 minutes. This effectively destroys the whole cluster due to timeouts to this node.

      The only solution we currently have is to drain the node and restart the repair, it has worked fine the second time every time.

      I attached heap charts from 3.0.9 and 3.0.14 during repair.

      Attachments

        1. 3.0.14cpu.png
          33 kB
          Stanislav Vishnevskiy
        2. 3.0.14heap.png
          127 kB
          Stanislav Vishnevskiy
        3. 3.0.9heap.png
          132 kB
          Stanislav Vishnevskiy
        4. 3.0.14.png
          27 kB
          Stanislav Vishnevskiy
        5. 3.0.9.png
          23 kB
          Stanislav Vishnevskiy

        Activity

          People

            Unassigned Unassigned
            stanislav Stanislav Vishnevskiy
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: