Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10226

JMX metric avgTimePerRequest broken

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.4.1
    • Fix Version/s: 6.5, 7.0
    • Component/s: metrics
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      JMX Metric avgTimePerRequest (of org.apache.solr.handler.component.SearchHandler) doesn't appear to behave correctly anymore. It was a cumulative value in pre-6.4 versions. Since totalTime metric was removed (which was a base for monitoring calculations), avgTimePerRequest seems like possible alternative to calculate "time spent in requests since last measurement", but it behaves strangely after 6.4.

      I did a simple test on gettingstarted collection (just unpacked the Solr 6.4.1 version and started it with "bin/solr start -e cloud -noprompt"). The query I used was:
      http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json
      I run it 30 times in a row (with approx 1 sec between executions).

      At the same time I was looking (with jconsole) at bean solr/gettingstarted_shard2_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler

      Here is how metric was changing over time (first number is "requests" metric, second number is "avgTimePerRequest"):
      10 6.6033
      12 5.9557
      13 0.9015 ---> 13th req would need negative duration if this was cumulative
      15 6.7315
      16 7.4873
      17 0.8458 ---> same case with 17th request
      23 6.1076

      At the same time bean solr/gettingstarted_shard1_replica2:type=/select,id=org.apache.solr.handler.component.SearchHandler also showed strange values:
      6 5.13482
      8 10.5694
      9 0.504
      10 0.344
      12 8.8121
      18 3.3531

      CC Andrzej Bialecki

      1. SOLR-10226.patch
        2 kB
        Andrzej Bialecki
      2. SOLR-10226.patch
        2 kB
        Andrzej Bialecki

        Activity

        Hide
        ab Andrzej Bialecki added a comment -

        At what intervals did you check these numbers? Previous versions of Solr used a very old version of the code, which supposedly used a similar exponentially decaying sampling (with a strong bias towards the most recent 5 min window). The new code uses an exponentially decaying reservoir sampling with the same bias so theoretically results should be the same...

        So, both in the new and in the old code these values would be exponentially decaying over time, so if you don't make any requests for a while the rates will fall down.

        If you could please capture the timestamps at which you made the requests and observations it would be very helpful - thank you!

        Show
        ab Andrzej Bialecki added a comment - At what intervals did you check these numbers? Previous versions of Solr used a very old version of the code, which supposedly used a similar exponentially decaying sampling (with a strong bias towards the most recent 5 min window). The new code uses an exponentially decaying reservoir sampling with the same bias so theoretically results should be the same... So, both in the new and in the old code these values would be exponentially decaying over time, so if you don't make any requests for a while the rates will fall down. If you could please capture the timestamps at which you made the requests and observations it would be very helpful - thank you!
        Hide
        bosmid Bojan Smid added a comment -

        I think avgTimePerRequest in previous versions didn't have decay/sampling applied on it. I am looking at one Solr 6.3 node which was up for the past 2 months (I checked few other setups, different Solr version, but saw the same behavior). Here are the stats from its standard handler:
        requests:1791464
        totalTime:564718.746333
        avgTimePerRequest:0.3152275157820643

        Both requests and totalTime metrics are cumulative and avgTimePerRequest shows exactly value totalTime/requests, therefore there was no decay/sampling applied in calculation of avgTime before 6.4.

        When it comes to previously posted sample, there was something like 30-60 sec or so between the requests (the time I needed to write down the numbers). I just did another test, fresh values (this time just 3-5 sec between the requests):
        1 85.3
        2 41.2
        3 26.1
        4 17.0
        6 11.08
        7 7.43
        8 4.98
        9 3.62
        11 3.28
        (few min pause)
        13 8.12
        14 3.33
        (few min pause)
        15 9.69
        16 4.09

        Does decay/sampling explain the behavior even with these short periods between the requests (ranging from few sec to few min)?

        Show
        bosmid Bojan Smid added a comment - I think avgTimePerRequest in previous versions didn't have decay/sampling applied on it. I am looking at one Solr 6.3 node which was up for the past 2 months (I checked few other setups, different Solr version, but saw the same behavior). Here are the stats from its standard handler: requests:1791464 totalTime:564718.746333 avgTimePerRequest:0.3152275157820643 Both requests and totalTime metrics are cumulative and avgTimePerRequest shows exactly value totalTime/requests, therefore there was no decay/sampling applied in calculation of avgTime before 6.4. When it comes to previously posted sample, there was something like 30-60 sec or so between the requests (the time I needed to write down the numbers). I just did another test, fresh values (this time just 3-5 sec between the requests): 1 85.3 2 41.2 3 26.1 4 17.0 6 11.08 7 7.43 8 4.98 9 3.62 11 3.28 (few min pause) 13 8.12 14 3.33 (few min pause) 15 9.69 16 4.09 Does decay/sampling explain the behavior even with these short periods between the requests (ranging from few sec to few min)?
        Hide
        ab Andrzej Bialecki added a comment -

        You're right - the Timer implementation that was used in Solr 6.3 and earlier internally used Histogram, which did not apply decaying to the total accumulated value. When we upgraded this class to a newer version from Codahale Metrics the underlying new implementation of Histogram does apply decaying to this value...

        Anyway, we have to add back a simple counter to track the total value as "totalTime", which somehow disappeared for no good reason. From that you will be able again to calculate the non-decaying average time.

        The question is what to do with avgTimePerRequest. In my opinion, moving forward we should keep the decaying avgTimePerRequest because it more correctly represents the recent state of the system as opposed to the cumulative non-decayed value, which doesn't really reflect anything in particular (there could have been extended periods of idle time followed by recent high activity, and the value would be still low even though the recent load was high).

        CC Otis Gospodnetic.

        Show
        ab Andrzej Bialecki added a comment - You're right - the Timer implementation that was used in Solr 6.3 and earlier internally used Histogram , which did not apply decaying to the total accumulated value. When we upgraded this class to a newer version from Codahale Metrics the underlying new implementation of Histogram does apply decaying to this value... Anyway, we have to add back a simple counter to track the total value as "totalTime", which somehow disappeared for no good reason. From that you will be able again to calculate the non-decaying average time. The question is what to do with avgTimePerRequest. In my opinion, moving forward we should keep the decaying avgTimePerRequest because it more correctly represents the recent state of the system as opposed to the cumulative non-decayed value, which doesn't really reflect anything in particular (there could have been extended periods of idle time followed by recent high activity, and the value would be still low even though the recent load was high). CC Otis Gospodnetic .
        Hide
        ab Andrzej Bialecki added a comment -

        Patch adding back non-decayed "totalTime".

        Show
        ab Andrzej Bialecki added a comment - Patch adding back non-decayed "totalTime".
        Hide
        bosmid Bojan Smid added a comment -

        Thanks for looking into this and patching it so quickly .

        From what I see, "totalTime" was removed in SOLR-8785. Having it back solves my problem (actually, any monitoring solution would need such cumulative total time). Re avgTimePerRequest - I agree with what you suggest, decayed value makes much more sense (non-decayed would only be useful as a hack to get to totalTime).

        Show
        bosmid Bojan Smid added a comment - Thanks for looking into this and patching it so quickly . From what I see, "totalTime" was removed in SOLR-8785 . Having it back solves my problem (actually, any monitoring solution would need such cumulative total time). Re avgTimePerRequest - I agree with what you suggest, decayed value makes much more sense (non-decayed would only be useful as a hack to get to totalTime).
        Hide
        bosmid Bojan Smid added a comment -

        I tested the patch quickly, metric totalTime is now there, but there is one small problem - it is expressed in ns. To be backward compatible it should be in ms.

        Show
        bosmid Bojan Smid added a comment - I tested the patch quickly, metric totalTime is now there, but there is one small problem - it is expressed in ns. To be backward compatible it should be in ms.
        Hide
        ab Andrzej Bialecki added a comment -

        Fixed totalTime conversion so that it's expressed in ms. If there are no objections I'll commit this shortly.

        Show
        ab Andrzej Bialecki added a comment - Fixed totalTime conversion so that it's expressed in ms. If there are no objections I'll commit this shortly.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2d51a42d3cae3eddc89f407cd3611fa2cd5d55d0 in lucene-solr's branch refs/heads/master from Andrzej Bialecki
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d51a42 ]

        SOLR-10226 JMX metric avgTimePerRequest broken.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2d51a42d3cae3eddc89f407cd3611fa2cd5d55d0 in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d51a42 ] SOLR-10226 JMX metric avgTimePerRequest broken.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 242ddfb148eda45347ff34d2b16958a835c340a5 in lucene-solr's branch refs/heads/branch_6x from Andrzej Bialecki
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=242ddfb ]

        SOLR-10226 JMX metric avgTimePerRequest broken.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 242ddfb148eda45347ff34d2b16958a835c340a5 in lucene-solr's branch refs/heads/branch_6x from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=242ddfb ] SOLR-10226 JMX metric avgTimePerRequest broken.

          People

          • Assignee:
            ab Andrzej Bialecki
            Reporter:
            bosmid Bojan Smid
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development