[CASSANDRA-13687] Abnormal heap growth and CPU usage during repair. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: None
Component/s: Legacy/Streaming and Messaging
Labels:
None

Severity:
Normal

Description

We recently upgraded from 3.0.9 to 3.0.14 to get the fix from ~~CASSANDRA-13004~~

Sadly 3 out of the last 7 nights we have had to wake up due Cassandra dying on us. We currently don't have any data to help reproduce this, but maybe since there aren't many commits between the 2 versions it might be obvious.

Basically we trigger a parallel incremental repair from a single node every night at 1AM. That node will sometimes start allocating a lot and keeping the heap maxed and triggering GC. Some of these GC can last up to 2 minutes. This effectively destroys the whole cluster due to timeouts to this node.

The only solution we currently have is to drain the node and restart the repair, it has worked fine the second time every time.

I attached heap charts from 3.0.9 and 3.0.14 during repair.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

3.0.14cpu.png
12/Jul/17 08:55
33 kB
Stanislav Vishnevskiy
3.0.14heap.png
12/Jul/17 08:52
127 kB
Stanislav Vishnevskiy
3.0.9heap.png
12/Jul/17 08:52
132 kB
Stanislav Vishnevskiy
3.0.14.png
12/Jul/17 05:37
27 kB
Stanislav Vishnevskiy
3.0.9.png
12/Jul/17 05:36
23 kB
Stanislav Vishnevskiy

Activity

People

Assignee:: Unassigned

Reporter:: Stanislav Vishnevskiy

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 12/Jul/17 05:38

Updated:: 16/Apr/19 09:30