Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
HDDS-816 added a json file in the OM to store persisted metrics like key count. The Jira has a doc attached that compares some options and decides that periodically flushing to a json file is the best approach. However, it neglects many issues with saving metrics this way:
- Error handling was missed. See HDDS-10094
- OMs' metrics can diverge if OMs are restarted at different times between flushes of the file.
- On snapshot install on a follower, the metric will be reset to estimated row count anyways. This follower will now have diverged metrics from the other OMs.
- When metrics for various OMs diverge, they will show different lines in dashboarding applications like Grafana, which may be confusing for users.
- Restoring the metric to a correct value after bugs like
HDDS-10063requires some sort of manual repair. - Once metrics diverge between OMs, even a restart will not bring them back in sync.
HDDS-1829 later added the ability for some metrics to be updated based on RocksDB key count estimates. See Q: How to know the number of keys stored in a RocksDB database? RocksDB FAQ. These metrics survive restart using the key count estimate and do not use the metrics json file, so we have two divergent implementations. However, once these metrics are updated on startup, they are not incremented as new OM operations come in.
This jira proposes:
- Get rid of the OM metrics json file.
- Use key count estimates for all metrics that must survive a restart.
- Continue to update these metrics as OM requests come in.
While the RocksDB estimated key count will not be totally accurate, the json based approach will not be either. The RocksDB approach is easier to maintain both in terms of code required and fixing metric counting bugs.
Attachments
Issue Links
- relates to
-
HDDS-1829 On OM reload/restart OmMetrics#numKeys should be updated
- Resolved
-
HDDS-10094 OM start failure due to failure to parse OM metrics file.
- Open
-
HDDS-816 Create OM metrics for bucket, volume, keys
- Resolved
- supercedes
-
HDDS-10065 Create repair tool for OM metrics
- Resolved
- links to