We ran drivers 3-days endurance tests against Cassandra 2.0.11 and C* crashed with an OOME. This happened both with ruby-driver 1.0-beta and java-driver 2.0.8-snapshot.
Attached are :
|OOME_node_system.log||The system.log of one Cassandra node that crashed|
|gc.log.gz||The GC log on the same node|
|heap-usage-after-gc.png||The heap occupancy evolution after every GC cycle|
|heap-usage-after-gc-zoom.png||A focus on when things start to go wrong|
Our test executes 5 CQL statements (select, insert, select, delete, select) for a given unique id, during 3 days, using multiple threads. There is not change in the workload during the test.
In the attached log, it seems something starts in Cassandra between 2014-11-06 10:29:22 and 2014-11-06 10:45:32. This causes an allocation that fills the heap. We eventually get stuck in a Full GC storm and get an OOME in the logs.
I have run the java-driver tests against Cassandra 1.2.19 and 2.1.1. The error does not occur. It seems specific to 2.0.11.
- is broken by
CASSANDRA-6998 HintedHandoff - expired hints may block future hints deliveries
- is related to
CASSANDRA-8164 OOM due to slow memory meter