Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15059

Default Grafana dashboard needs to expose graphs for monitoring query performance

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 8.8, 9.0
    • Grafana Dashboard, metrics
    • None

    Description

      The default Grafana dashboard doesn't expose graphs for monitoring query performance. For instance, if I want to see QPS for a collection, that's not shown in the default dashboard. Same for quantiles like p95 query latency.

      After some digging, these metrics are available in the output from /admin/metrics but are not exported by the exporter.

      This PR proposes to enhance the default dashboard with a new Query Metrics section with the following metrics:

      • Distributed QPS per Collection (aggregated across all cores)
      • Distributed QPS per Solr Node (aggregated across all base_url)
      • QPS 1-min rate per core
      • QPS 5-min rate per core
      • Top-level Query latency p99, p95, p75
      • Local (non-distrib) query count per core (this is important for determining if there is unbalanced load)
      • Local (non-distrib) query rate per core (1-min)
      • Local (non-distrib) p95 per core

      Also, the solr-exporter-config.xml uses jq queries to pull metrics from the output from /admin/metrics. This file is huge and contains a bunch of jq boilerplate. Moreover, I'm introducing another 15-20 metrics in this PR, it only makes the file more verbose.

      Thus, I'm also introducing support for jq templates so as to reduce boilerplate, reduce syntax errors, and improve readability. For instance the query metrics I'm adding to the config look like this:

                <str>
                  $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
                </str>
                <str>
                  $jq:core-query(5minRate, endswith(".distrib.requestTimes"))
                </str>
      

      Instead of duplicating the complicated jq query for each metric. The templates are optional and only should be used if a given jq structure is repeated 3 or more times. Otherwise, inlining the jq query is still supported. Here's how the templates work:

        A regex with named groups is used to match template references to template + vars using the basic pattern:
      
            $jq:<TEMPLATE>( <UNIQUE>, <KEYSELECTOR>, <METRIC>, <TYPE> )
      
        For instance,
      
            $jq:core(requests_total, endswith(".requestTimes"), count, COUNTER)
      
        TEMPLATE = core
        UNIQUE = requests_total (unique suffix for this metric, results in a metric named "solr_metrics_core_requests_total")
        KEYSELECTOR = endswith(".requestTimes") (filter to select the specific key for this metric)
        METRIC = count
        TYPE = COUNTER
      
        Some templates may have a default type, so you can omit that from your template reference, such as:
      
            $jq:core(requests_total, endswith(".requestTimes"), count)
      
        Uses the defaultType=COUNTER as many uses of the core template are counts.
      
        If a template reference omits the metric, then the unique suffix is used, for instance:
      
            $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
      
        Creates a GAUGE metric (default type) named "solr_metrics_core_query_1minRate" using the 1minRate value from the selected JSON object.
      

      Just so people don't have to go digging in the large diff on the config XML, here are the query metrics I'm adding to the exporter config with use of the templates idea:

                <str>
                  $jq:core-query(errors_1minRate, select(.key | endswith(".errors")), 1minRate)
                </str>
                <str>
                  $jq:core-query(client_errors_1minRate, select(.key | endswith(".clientErrors")), 1minRate)
                </str>
                <str>
                  $jq:core-query(1minRate, select(.key | endswith(".distrib.requestTimes")), 1minRate)
                </str>
                <str>
                  $jq:core-query(5minRate, select(.key | endswith(".distrib.requestTimes")), 5minRate)
                </str>
                <str>
                  $jq:core-query(median_ms, select(.key | endswith(".distrib.requestTimes")), median_ms)
                </str>
                <str>
                  $jq:core-query(p75_ms, select(.key | endswith(".distrib.requestTimes")), p75_ms)
                </str>
                <str>
                  $jq:core-query(p95_ms, select(.key | endswith(".distrib.requestTimes")), p95_ms)
                </str>
                <str>
                  $jq:core-query(p99_ms, select(.key | endswith(".distrib.requestTimes")), p99_ms)
                </str>
                <str>
                  $jq:core-query(mean_rate, select(.key | endswith(".distrib.requestTimes")), meanRate)
                </str>
                
                <!-- Local (non-distrib) query metrics -->
                <str>
                  $jq:core-query(local_1minRate, select(.key | endswith(".local.requestTimes")), 1minRate)
                </str>
                <str>
                  $jq:core-query(local_5minRate, select(.key | endswith(".local.requestTimes")), 5minRate)
                </str>
                <str>
                  $jq:core-query(local_median_ms, select(.key | endswith(".local.requestTimes")), median_ms)
                </str>
                <str>
                  $jq:core-query(local_p75_ms, select(.key | endswith(".local.requestTimes")), p75_ms)
                </str>
                <str>
                  $jq:core-query(local_p95_ms, select(.key | endswith(".local.requestTimes")), p95_ms)
                </str>
                <str>
                  $jq:core-query(local_p99_ms, select(.key | endswith(".local.requestTimes")), p99_ms)
                </str>
                <str>
                  $jq:core-query(local_mean_rate, select(.key | endswith(".local.requestTimes")), meanRate)
                </str>
                <str>
                  $jq:core-query(local_count, select(.key | endswith(".local.requestTimes")), count, COUNTER)
                </str>
      

      Attachments

        Activity

          People

            thelabdude Timothy Potter
            thelabdude Timothy Potter
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 10m
                3h 10m