[CASSANDRA-13096] Snapshots slow down jmx scraping - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Not A Problem
Fix Version/s: None
Component/s: Observability/Metrics
Labels:
None

Severity:
Normal

Description

Hello,

We are scraping the jmx metrics through a prometheus exporter and we noticed that some nodes became really long to answer (more than 20 seconds). After some investigations we do not find any hardware problem or overload issues on there "slow" nodes. It happens on different clusters, some with only few giga bytes of dataset and it does not seams to be related to a specific version neither as it happens on 2.1, 2.2 and 3.0 nodes.

After some unsuccessful actions, one of our ideas was to clean the snapshots staying on one problematic node:

nodetool clearsnapshot

And the magic happens... as you can see in the attached diagrams, the second we cleared the snapshots, the CPU activity dropped immediatly and the duration to scrape the jmx metrics goes from +20 secs to instantaneous...

Can you enlighten us on this issue? Once again, it appears on our three 2.1, 2.2 and 3.0 versions, on different volumetry and it is not systematically linked to the snapshots as we have some nodes with the same snapshots volume which are going pretty well.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

JMX Scrape Duration.png
04/Jan/17 17:07
149 kB
Maxime Fouilleul
CPU Load.png
04/Jan/17 17:07
300 kB
Maxime Fouilleul
Clear Snapshots.png
04/Jan/17 17:07
148 kB
Maxime Fouilleul

Issue Links

relates to

CASSANDRA-16843 List snapshots of dropped tables

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Maxime Fouilleul

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 04/Jan/17 17:04

Updated:: 28/Apr/22 19:57

Resolved:: 14/Dec/20 16:02