[CASSANDRA-10245] Provide after the fact visibility into the reliability of the environment C* operates in - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: 5.x
Component/s: Legacy/Observability
Labels:
None

Description

I think that by default databases should not be completely dependent on operator provided tools for monitoring node and network health.

The database should be able to detect and report on several dimensions of performance in its environment, and more specifically report on deviations from acceptable performance.

Node wide pauses
JVM wide pauses
Latency, and roundtrip time to all endpoints
Block device IO latency

If flight recorder were available for use in production I would say as a start just turn that on, add jHiccup (inside and outside the server process), and a daemon inside the server to measure network performance between endpoints.

FR is not available (requires a license in production) so instead focus on adding instrumentation for the most useful facets of flight recorder in diagnosing performance issues. I think we can get pretty far because what we need to do is not quite as undirected as the exploration FR and JMC facilitate.

Until we dial in how we measure and how to signal without false positives I would expect this kind of logging to be in the background for post-hoc analysis.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Ariel Weisberg

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 01/Sep/15 21:30

Updated:: 07/Mar/23 10:54