Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9168 Improve Ozone metrics
  3. HDDS-10110

Use RocksDB key count estimates instead of OM metrics file

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • OM

    Description

      HDDS-816 added a json file in the OM to store persisted metrics like key count. The Jira has a doc attached that compares some options and decides that periodically flushing to a json file is the best approach. However, it neglects many issues with saving metrics this way:

      • Error handling was missed. See HDDS-10094
      • OMs' metrics can diverge if OMs are restarted at different times between flushes of the file.
      • On snapshot install on a follower, the metric will be reset to estimated row count anyways. This follower will now have diverged metrics from the other OMs.
      • When metrics for various OMs diverge, they will show different lines in dashboarding applications like Grafana, which may be confusing for users.
      • Restoring the metric to a correct value after bugs like HDDS-10063 requires some sort of manual repair.
      • Once metrics diverge between OMs, even a restart will not bring them back in sync.

      HDDS-1829 later added the ability for some metrics to be updated based on RocksDB key count estimates. See Q: How to know the number of keys stored in a RocksDB database? RocksDB FAQ. These metrics survive restart using the key count estimate and do not use the metrics json file, so we have two divergent implementations. However, once these metrics are updated on startup, they are not incremented as new OM operations come in.

      This jira proposes:

      1. Get rid of the OM metrics json file.
      2. Use key count estimates for all metrics that must survive a restart.
      3. Continue to update these metrics as OM requests come in.

      While the RocksDB estimated key count will not be totally accurate, the json based approach will not be either. The RocksDB approach is easier to maintain both in terms of code required and fixing metric counting bugs.

      Attachments

        Issue Links

          Activity

            People

              erose Ethan Rose
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: