Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Right now there is little to no information exposed about the overseer from SolrCloud.

      I propose that we have an API for overseer status which can return:

      1. Past N commands executed (grouped by command type)
      2. Status (queue-size, current overseer leader node)
      3. Overseer log
      1. SOLR-5749.patch
        44 kB
        Shalin Shekhar Mangar
      2. SOLR-5749.patch
        45 kB
        Shalin Shekhar Mangar
      3. SOLR-5749.patch
        41 kB
        Shalin Shekhar Mangar
      4. SOLR-5749.patch
        35 kB
        Shalin Shekhar Mangar
      5. SOLR-5749.patch
        18 kB
        Shalin Shekhar Mangar
      6. SOLR-5749.patch
        18 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          +1

          Show
          Mark Miller added a comment - +1
          Hide
          Timothy Potter added a comment -

          Would it make sense to also include metrics for the overseer (possibly using the codahale stuff described in SOLR-4735)? These metrics could then support the decision to apply SOLR-5746 (dedicated/higher-end node for overseer). Key metrics I'd be interested in tracking are:

          1) metrics about how many messages per second / minute the overseer is processing
          2) metrics about how long it takes to process each type of operation (leader, state, etc.)
          3) metrics around queue sizes / activity (e.g. workQueue has 1000 messages pending)

          Show
          Timothy Potter added a comment - Would it make sense to also include metrics for the overseer (possibly using the codahale stuff described in SOLR-4735 )? These metrics could then support the decision to apply SOLR-5746 (dedicated/higher-end node for overseer). Key metrics I'd be interested in tracking are: 1) metrics about how many messages per second / minute the overseer is processing 2) metrics about how long it takes to process each type of operation (leader, state, etc.) 3) metrics around queue sizes / activity (e.g. workQueue has 1000 messages pending)
          Hide
          Shalin Shekhar Mangar added a comment -

          This adds /admin/collections?action=OVERSEERSTATUS API. Stats added are:

          1. success and error counts
          2. queue sizes for overseer, overseer work queue and overseer collection queue
          3. various timing statistics per operation type

          I'm still working on the tests.

          Show
          Shalin Shekhar Mangar added a comment - This adds /admin/collections?action=OVERSEERSTATUS API. Stats added are: success and error counts queue sizes for overseer, overseer work queue and overseer collection queue various timing statistics per operation type I'm still working on the tests.
          Hide
          Mark Miller added a comment -

          Nice Shalin!

          Show
          Mark Miller added a comment - Nice Shalin!
          Hide
          Shalin Shekhar Mangar added a comment -

          Here's how it looks right now:
          http://localhost:8983/solr/admin/collections?action=overseerstatus

          <?xml version="1.0" encoding="UTF-8"?>
          <response>
            <lst name="responseHeader">
              <int name="status">0</int>
              <int name="QTime">26</int>
            </lst>
            <str name="leader">192.168.1.3:8983_solr</str>
            <int name="overseer_queue_size">0</int>
            <int name="overseer_work_queue_size">0</int>
            <int name="overseer_collection_queue_size">2</int>
            <lst name="stats">
              <lst name="leader">
                <int name="requests">4</int>
                <int name="errors">0</int>
                <double name="totalTime">0.599</double>
                <double name="avgRequestsPerSecond">0.07359325662045857</double>
                <double name="5minRateReqsPerSecond">0.3504682187309409</double>
                <double name="15minRateReqsPerSecond">0.38265912794758644</double>
                <double name="avgTimePerRequest">0.14975</double>
                <double name="medianRequestTime">0.1395</double>
                <double name="75thPcRequestTime">0.179</double>
                <double name="95thPcRequestTime">0.19</double>
                <double name="99thPcRequestTime">0.19</double>
                <double name="999thPcRequestTime">0.19</double>
              </lst>
              <lst name="state">
                <int name="requests">4</int>
                <int name="errors">0</int>
                <double name="totalTime">8.589</double>
                <double name="avgRequestsPerSecond">0.06929964428146092</double>
                <double name="5minRateReqsPerSecond">0.3504682187309409</double>
                <double name="15minRateReqsPerSecond">0.38265912794758644</double>
                <double name="avgTimePerRequest">2.14725</double>
                <double name="medianRequestTime">0.8644999999999999</double>
                <double name="75thPcRequestTime">5.18075</double>
                <double name="95thPcRequestTime">6.531</double>
                <double name="99thPcRequestTime">6.531</double>
                <double name="999thPcRequestTime">6.531</double>
              </lst>
            </lst>
          </response>
          
          Show
          Shalin Shekhar Mangar added a comment - Here's how it looks right now: http://localhost:8983/solr/admin/collections?action=overseerstatus <?xml version= "1.0" encoding= "UTF-8" ?> <response> <lst name= "responseHeader" > <int name= "status" > 0 </int> <int name= "QTime" > 26 </int> </lst> <str name= "leader" > 192.168.1.3:8983_solr </str> <int name= "overseer_queue_size" > 0 </int> <int name= "overseer_work_queue_size" > 0 </int> <int name= "overseer_collection_queue_size" > 2 </int> <lst name= "stats" > <lst name= "leader" > <int name= "requests" > 4 </int> <int name= "errors" > 0 </int> <double name= "totalTime" > 0.599 </double> <double name= "avgRequestsPerSecond" > 0.07359325662045857 </double> <double name= "5minRateReqsPerSecond" > 0.3504682187309409 </double> <double name= "15minRateReqsPerSecond" > 0.38265912794758644 </double> <double name= "avgTimePerRequest" > 0.14975 </double> <double name= "medianRequestTime" > 0.1395 </double> <double name= "75thPcRequestTime" > 0.179 </double> <double name= "95thPcRequestTime" > 0.19 </double> <double name= "99thPcRequestTime" > 0.19 </double> <double name= "999thPcRequestTime" > 0.19 </double> </lst> <lst name= "state" > <int name= "requests" > 4 </int> <int name= "errors" > 0 </int> <double name= "totalTime" > 8.589 </double> <double name= "avgRequestsPerSecond" > 0.06929964428146092 </double> <double name= "5minRateReqsPerSecond" > 0.3504682187309409 </double> <double name= "15minRateReqsPerSecond" > 0.38265912794758644 </double> <double name= "avgTimePerRequest" > 2.14725 </double> <double name= "medianRequestTime" > 0.8644999999999999 </double> <double name= "75thPcRequestTime" > 5.18075 </double> <double name= "95thPcRequestTime" > 6.531 </double> <double name= "99thPcRequestTime" > 6.531 </double> <double name= "999thPcRequestTime" > 6.531 </double> </lst> </lst> </response>
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Mark.

          Timothy Potter - I didn't use the metrics APIs (that's a big issue!) but you'll find that all of your demands are met by this patch.

          I think we should rename "stats" to "operations" and have the timing done per minute instead of per-second since Overseer operations are not that frequent. I am working on capturing the past N operations and past N failures (exceptions) per operation to the stats. Right now the stats are in-memory which means that we lose them if the overseer dies. I think that we should periodically, say every 15 minutes, save the stats to ZK and initialize the stats from ZK when a new Overseer starts.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Mark. Timothy Potter - I didn't use the metrics APIs (that's a big issue!) but you'll find that all of your demands are met by this patch. I think we should rename "stats" to "operations" and have the timing done per minute instead of per-second since Overseer operations are not that frequent. I am working on capturing the past N operations and past N failures (exceptions) per operation to the stats. Right now the stats are in-memory which means that we lose them if the overseer dies. I think that we should periodically, say every 15 minutes, save the stats to ZK and initialize the stats from ZK when a new Overseer starts.
          Hide
          Shalin Shekhar Mangar added a comment -
          • Track stats per-minute instead of per-second
          • Added collection processor statistics as well
          Show
          Shalin Shekhar Mangar added a comment - Track stats per-minute instead of per-second Added collection processor statistics as well
          Hide
          Noble Paul added a comment -

          We probably should add statistics on time taken to read the messages from ZK

          Show
          Noble Paul added a comment - We probably should add statistics on time taken to read the messages from ZK
          Hide
          Shalin Shekhar Mangar added a comment -

          Added statistics on DistributedQueue as well.

          {
            "responseHeader":{
              "status":0,
              "QTime":28},
            "leader":"127.0.1.1:8983_solr",
            "overseer_queue_size":0,
            "overseer_work_queue_size":0,
            "overseer_collection_queue_size":2,
            "overseer_operations":[
              "createcollection",{
                "requests":1,
                "errors":0,
                "totalTime":0.971481,
                "avgRequestsPerMinute":0.4506035128995973,
                "5minRateReqsPerMinute":7.910887562405325,
                "15minRateReqsPerMinute":10.443896710000685,
                "avgTimePerRequest":0.971481,
                "medianRequestTime":0.971481,
                "75thPcRequestTime":0.971481,
                "95thPcRequestTime":0.971481,
                "99thPcRequestTime":0.971481,
                "999thPcRequestTime":0.971481},
              "removeshard",{
                "requests":2,
                "errors":0,
                "totalTime":0.528514,
                "avgRequestsPerMinute":1.3925947936037932,
                "5minRateReqsPerMinute":9.662379888216847,
                "15minRateReqsPerMinute":11.16388960970474,
                "avgTimePerRequest":0.264257,
                "medianRequestTime":0.26425699999999996,
                "75thPcRequestTime":0.273016,
                "95thPcRequestTime":0.273016,
                "99thPcRequestTime":0.273016,
                "999thPcRequestTime":0.273016},
              "updateshardstate",{
                "requests":3,
                "errors":0,
                "totalTime":2.400814,
                "avgRequestsPerMinute":1.585204122426084,
                "5minRateReqsPerMinute":8.747699997919707,
                "15minRateReqsPerMinute":10.798558466515363,
                "avgTimePerRequest":0.8002713333333333,
                "medianRequestTime":0.653465,
                "75thPcRequestTime":1.194389,
                "95thPcRequestTime":1.194389,
                "99thPcRequestTime":1.194389,
                "999thPcRequestTime":1.194389},
              "state",{
                "requests":20,
                "errors":0,
                "totalTime":24.947245,
                "avgRequestsPerMinute":7.125324255165824,
                "5minRateReqsPerMinute":17.05318606289356,
                "15minRateReqsPerMinute":21.377358367732,
                "avgTimePerRequest":1.24736225,
                "medianRequestTime":0.6480015,
                "75thPcRequestTime":0.77874075,
                "95thPcRequestTime":10.828164299999994,
                "99thPcRequestTime":11.242307,
                "999thPcRequestTime":11.242307},
              "createshard",{
                "requests":6,
                "errors":0,
                "totalTime":0.6182,
                "avgRequestsPerMinute":3.114875928429097,
                "5minRateReqsPerMinute":17.4905427262541,
                "15minRateReqsPerMinute":21.596450394663627,
                "avgTimePerRequest":0.10303333333333332,
                "medianRequestTime":0.1074325,
                "75thPcRequestTime":0.128297,
                "95thPcRequestTime":0.155576,
                "99thPcRequestTime":0.155576,
                "999thPcRequestTime":0.155576},
              "leader",{
                "requests":20,
                "errors":0,
                "totalTime":2.068689,
                "avgRequestsPerMinute":7.222057819167628,
                "5minRateReqsPerMinute":17.568357153570165,
                "15minRateReqsPerMinute":21.607558446226854,
                "avgTimePerRequest":0.10343445,
                "medianRequestTime":0.105716,
                "75thPcRequestTime":0.125223,
                "95thPcRequestTime":0.15367004999999997,
                "99thPcRequestTime":0.154579,
                "999thPcRequestTime":0.154579},
              "deletecore",{
                "requests":2,
                "errors":0,
                "totalTime":0.108824,
                "avgRequestsPerMinute":1.3923955098922307,
                "5minRateReqsPerMinute":9.662379888216847,
                "15minRateReqsPerMinute":11.16388960970474,
                "avgTimePerRequest":0.054412,
                "medianRequestTime":0.054412,
                "75thPcRequestTime":0.058704,
                "95thPcRequestTime":0.058704,
                "99thPcRequestTime":0.058704,
                "999thPcRequestTime":0.058704}],
            "collection_operations":[
              "overseerstatus",{
                "requests":3,
                "errors":0,
                "totalTime":14.146347,
                "avgRequestsPerMinute":1.2874214823830716,
                "5minRateReqsPerMinute":8.124718170996797,
                "15minRateReqsPerMinute":10.512878920945859,
                "avgTimePerRequest":4.715449,
                "medianRequestTime":4.757046,
                "75thPcRequestTime":4.925945,
                "95thPcRequestTime":4.925945,
                "99thPcRequestTime":4.925945,
                "999thPcRequestTime":4.925945},
              "splitshard",{
                "requests":3,
                "errors":0,
                "totalTime":6937.350684,
                "avgRequestsPerMinute":1.5573643647163575,
                "5minRateReqsPerMinute":8.747699997919707,
                "15minRateReqsPerMinute":10.798558466515363,
                "avgTimePerRequest":2312.450228,
                "medianRequestTime":2027.78933,
                "75thPcRequestTime":2935.911503,
                "95thPcRequestTime":2935.911503,
                "99thPcRequestTime":2935.911503,
                "999thPcRequestTime":2935.911503},
              "createcollection",{
                "requests":1,
                "errors":0,
                "totalTime":2236.185807,
                "avgRequestsPerMinute":0.45055336110011696,
                "5minRateReqsPerMinute":7.910887562405325,
                "15minRateReqsPerMinute":10.443896710000685,
                "avgTimePerRequest":2236.185807,
                "medianRequestTime":2236.185807,
                "75thPcRequestTime":2236.185807,
                "95thPcRequestTime":2236.185807,
                "99thPcRequestTime":2236.185807,
                "999thPcRequestTime":2236.185807},
              "deleteshard",{
                "requests":2,
                "errors":1,
                "totalTime":449.782332,
                "avgRequestsPerMinute":2.088013136687043,
                "5minRateReqsPerMinute":9.6597401952259,
                "15minRateReqsPerMinute":11.163546953268966,
                "avgTimePerRequest":149.927444,
                "medianRequestTime":223.62014,
                "75thPcRequestTime":224.781676,
                "95thPcRequestTime":224.781676,
                "99thPcRequestTime":224.781676,
                "999thPcRequestTime":224.781676}],
            "overseer_queue":[
              "peek_wait100",{
                "totalTime":2373.782195,
                "avgRequestsPerMinute":14.806326086807369,
                "5minRateReqsPerMinute":20.993580169733175,
                "15minRateReqsPerMinute":22.99699762014067,
                "avgTimePerRequest":57.89712670731708,
                "medianRequestTime":100.719081,
                "75thPcRequestTime":101.3016165,
                "95thPcRequestTime":101.80825329999999,
                "99thPcRequestTime":101.903817,
                "999thPcRequestTime":101.903817},
              "peek_wait_forever",{
                "totalTime":88855.760833,
                "avgRequestsPerMinute":12.093958880705209,
                "5minRateReqsPerMinute":34.44753369023466,
                "15minRateReqsPerMinute":42.880536002203655,
                "avgTimePerRequest":2613.4047303823527,
                "medianRequestTime":87.3240445,
                "75thPcRequestTime":2211.790657,
                "95thPcRequestTime":17847.67253375,
                "99thPcRequestTime":25305.997634,
                "999thPcRequestTime":25305.997634},
              "remove",{
                "totalTime":83.723085,
                "avgRequestsPerMinute":19.238670321778937,
                "5minRateReqsPerMinute":37.37089757629722,
                "15minRateReqsPerMinute":44.081383699639126,
                "avgTimePerRequest":1.5504275,
                "medianRequestTime":1.424577,
                "75thPcRequestTime":1.7675002499999999,
                "95thPcRequestTime":3.0253507500000003,
                "99thPcRequestTime":4.435305,
                "999thPcRequestTime":4.435305},
              "poll",{
                "totalTime":85.616811,
                "avgRequestsPerMinute":19.238605533763508,
                "5minRateReqsPerMinute":37.37089757629722,
                "15minRateReqsPerMinute":44.081383699639126,
                "avgTimePerRequest":1.5854965,
                "medianRequestTime":1.448463,
                "75thPcRequestTime":1.815407,
                "95thPcRequestTime":3.08085125,
                "99thPcRequestTime":4.510461,
                "999thPcRequestTime":4.510461}],
            "overseer_internal_queue":[
              "peek",{
                "totalTime":0.734537,
                "avgRequestsPerMinute":0.35570294684187176,
                "5minRateReqsPerMinute":7.158067076339619,
                "15minRateReqsPerMinute":10.101505049683713,
                "avgTimePerRequest":0.734537,
                "medianRequestTime":0.734537,
                "75thPcRequestTime":0.734537,
                "95thPcRequestTime":0.734537,
                "99thPcRequestTime":0.734537,
                "999thPcRequestTime":0.734537},
              "offer",{
                "totalTime":43.474119,
                "avgRequestsPerMinute":19.2386904364582,
                "5minRateReqsPerMinute":37.37089757629722,
                "15minRateReqsPerMinute":44.081383699639126,
                "avgTimePerRequest":0.8050762777777778,
                "medianRequestTime":0.815022,
                "75thPcRequestTime":1.017915,
                "95thPcRequestTime":1.33674325,
                "99thPcRequestTime":1.787279,
                "999thPcRequestTime":1.787279},
              "remove",{
                "totalTime":131.244284,
                "avgRequestsPerMinute":31.35258075987943,
                "5minRateReqsPerMinute":71.81843126653187,
                "15minRateReqsPerMinute":86.9619197018428,
                "avgTimePerRequest":1.4914123181818182,
                "medianRequestTime":1.3513225,
                "75thPcRequestTime":2.18225375,
                "95thPcRequestTime":3.0428447,
                "99thPcRequestTime":3.656696,
                "999thPcRequestTime":3.656696},
              "poll",{
                "totalTime":135.212298,
                "avgRequestsPerMinute":31.352528139029836,
                "5minRateReqsPerMinute":71.81843126653187,
                "15minRateReqsPerMinute":86.9619197018428,
                "avgTimePerRequest":1.5365033863636364,
                "medianRequestTime":1.3905595000000002,
                "75thPcRequestTime":2.2290725,
                "95thPcRequestTime":3.0980611,
                "99thPcRequestTime":3.714584,
                "999thPcRequestTime":3.714584}],
            "collection_queue":[
              "remove_event",{
                "totalTime":34.544719,
                "avgRequestsPerMinute":4.2920515037757445,
                "5minRateReqsPerMinute":9.430673865735049,
                "15minRateReqsPerMinute":11.051746440368753,
                "avgTimePerRequest":3.4544718999999997,
                "medianRequestTime":3.161874,
                "75thPcRequestTime":4.8347845,
                "95thPcRequestTime":5.08054,
                "99thPcRequestTime":5.08054,
                "999thPcRequestTime":5.08054},
              "peek_wait_forever",{
                "totalTime":158944.84134,
                "avgRequestsPerMinute":3.9126903584045136,
                "5minRateReqsPerMinute":1.517316852098354,
                "15minRateReqsPerMinute":0.607514604536483,
                "avgTimePerRequest":14449.53103090909,
                "medianRequestTime":11498.972881,
                "75thPcRequestTime":27505.157498,
                "95thPcRequestTime":35014.040773,
                "99thPcRequestTime":35014.040773,
                "999thPcRequestTime":35014.040773}]}
          
          Show
          Shalin Shekhar Mangar added a comment - Added statistics on DistributedQueue as well. { "responseHeader" :{ "status" :0, "QTime" :28}, "leader" : "127.0.1.1:8983_solr" , "overseer_queue_size" :0, "overseer_work_queue_size" :0, "overseer_collection_queue_size" :2, "overseer_operations" :[ "createcollection" ,{ "requests" :1, "errors" :0, "totalTime" :0.971481, "avgRequestsPerMinute" :0.4506035128995973, "5minRateReqsPerMinute" :7.910887562405325, "15minRateReqsPerMinute" :10.443896710000685, "avgTimePerRequest" :0.971481, "medianRequestTime" :0.971481, "75thPcRequestTime" :0.971481, "95thPcRequestTime" :0.971481, "99thPcRequestTime" :0.971481, "999thPcRequestTime" :0.971481}, "removeshard" ,{ "requests" :2, "errors" :0, "totalTime" :0.528514, "avgRequestsPerMinute" :1.3925947936037932, "5minRateReqsPerMinute" :9.662379888216847, "15minRateReqsPerMinute" :11.16388960970474, "avgTimePerRequest" :0.264257, "medianRequestTime" :0.26425699999999996, "75thPcRequestTime" :0.273016, "95thPcRequestTime" :0.273016, "99thPcRequestTime" :0.273016, "999thPcRequestTime" :0.273016}, "updateshardstate" ,{ "requests" :3, "errors" :0, "totalTime" :2.400814, "avgRequestsPerMinute" :1.585204122426084, "5minRateReqsPerMinute" :8.747699997919707, "15minRateReqsPerMinute" :10.798558466515363, "avgTimePerRequest" :0.8002713333333333, "medianRequestTime" :0.653465, "75thPcRequestTime" :1.194389, "95thPcRequestTime" :1.194389, "99thPcRequestTime" :1.194389, "999thPcRequestTime" :1.194389}, "state" ,{ "requests" :20, "errors" :0, "totalTime" :24.947245, "avgRequestsPerMinute" :7.125324255165824, "5minRateReqsPerMinute" :17.05318606289356, "15minRateReqsPerMinute" :21.377358367732, "avgTimePerRequest" :1.24736225, "medianRequestTime" :0.6480015, "75thPcRequestTime" :0.77874075, "95thPcRequestTime" :10.828164299999994, "99thPcRequestTime" :11.242307, "999thPcRequestTime" :11.242307}, "createshard" ,{ "requests" :6, "errors" :0, "totalTime" :0.6182, "avgRequestsPerMinute" :3.114875928429097, "5minRateReqsPerMinute" :17.4905427262541, "15minRateReqsPerMinute" :21.596450394663627, "avgTimePerRequest" :0.10303333333333332, "medianRequestTime" :0.1074325, "75thPcRequestTime" :0.128297, "95thPcRequestTime" :0.155576, "99thPcRequestTime" :0.155576, "999thPcRequestTime" :0.155576}, "leader" ,{ "requests" :20, "errors" :0, "totalTime" :2.068689, "avgRequestsPerMinute" :7.222057819167628, "5minRateReqsPerMinute" :17.568357153570165, "15minRateReqsPerMinute" :21.607558446226854, "avgTimePerRequest" :0.10343445, "medianRequestTime" :0.105716, "75thPcRequestTime" :0.125223, "95thPcRequestTime" :0.15367004999999997, "99thPcRequestTime" :0.154579, "999thPcRequestTime" :0.154579}, "deletecore" ,{ "requests" :2, "errors" :0, "totalTime" :0.108824, "avgRequestsPerMinute" :1.3923955098922307, "5minRateReqsPerMinute" :9.662379888216847, "15minRateReqsPerMinute" :11.16388960970474, "avgTimePerRequest" :0.054412, "medianRequestTime" :0.054412, "75thPcRequestTime" :0.058704, "95thPcRequestTime" :0.058704, "99thPcRequestTime" :0.058704, "999thPcRequestTime" :0.058704}], "collection_operations" :[ "overseerstatus" ,{ "requests" :3, "errors" :0, "totalTime" :14.146347, "avgRequestsPerMinute" :1.2874214823830716, "5minRateReqsPerMinute" :8.124718170996797, "15minRateReqsPerMinute" :10.512878920945859, "avgTimePerRequest" :4.715449, "medianRequestTime" :4.757046, "75thPcRequestTime" :4.925945, "95thPcRequestTime" :4.925945, "99thPcRequestTime" :4.925945, "999thPcRequestTime" :4.925945}, "splitshard" ,{ "requests" :3, "errors" :0, "totalTime" :6937.350684, "avgRequestsPerMinute" :1.5573643647163575, "5minRateReqsPerMinute" :8.747699997919707, "15minRateReqsPerMinute" :10.798558466515363, "avgTimePerRequest" :2312.450228, "medianRequestTime" :2027.78933, "75thPcRequestTime" :2935.911503, "95thPcRequestTime" :2935.911503, "99thPcRequestTime" :2935.911503, "999thPcRequestTime" :2935.911503}, "createcollection" ,{ "requests" :1, "errors" :0, "totalTime" :2236.185807, "avgRequestsPerMinute" :0.45055336110011696, "5minRateReqsPerMinute" :7.910887562405325, "15minRateReqsPerMinute" :10.443896710000685, "avgTimePerRequest" :2236.185807, "medianRequestTime" :2236.185807, "75thPcRequestTime" :2236.185807, "95thPcRequestTime" :2236.185807, "99thPcRequestTime" :2236.185807, "999thPcRequestTime" :2236.185807}, "deleteshard" ,{ "requests" :2, "errors" :1, "totalTime" :449.782332, "avgRequestsPerMinute" :2.088013136687043, "5minRateReqsPerMinute" :9.6597401952259, "15minRateReqsPerMinute" :11.163546953268966, "avgTimePerRequest" :149.927444, "medianRequestTime" :223.62014, "75thPcRequestTime" :224.781676, "95thPcRequestTime" :224.781676, "99thPcRequestTime" :224.781676, "999thPcRequestTime" :224.781676}], "overseer_queue" :[ "peek_wait100" ,{ "totalTime" :2373.782195, "avgRequestsPerMinute" :14.806326086807369, "5minRateReqsPerMinute" :20.993580169733175, "15minRateReqsPerMinute" :22.99699762014067, "avgTimePerRequest" :57.89712670731708, "medianRequestTime" :100.719081, "75thPcRequestTime" :101.3016165, "95thPcRequestTime" :101.80825329999999, "99thPcRequestTime" :101.903817, "999thPcRequestTime" :101.903817}, "peek_wait_forever" ,{ "totalTime" :88855.760833, "avgRequestsPerMinute" :12.093958880705209, "5minRateReqsPerMinute" :34.44753369023466, "15minRateReqsPerMinute" :42.880536002203655, "avgTimePerRequest" :2613.4047303823527, "medianRequestTime" :87.3240445, "75thPcRequestTime" :2211.790657, "95thPcRequestTime" :17847.67253375, "99thPcRequestTime" :25305.997634, "999thPcRequestTime" :25305.997634}, "remove" ,{ "totalTime" :83.723085, "avgRequestsPerMinute" :19.238670321778937, "5minRateReqsPerMinute" :37.37089757629722, "15minRateReqsPerMinute" :44.081383699639126, "avgTimePerRequest" :1.5504275, "medianRequestTime" :1.424577, "75thPcRequestTime" :1.7675002499999999, "95thPcRequestTime" :3.0253507500000003, "99thPcRequestTime" :4.435305, "999thPcRequestTime" :4.435305}, "poll" ,{ "totalTime" :85.616811, "avgRequestsPerMinute" :19.238605533763508, "5minRateReqsPerMinute" :37.37089757629722, "15minRateReqsPerMinute" :44.081383699639126, "avgTimePerRequest" :1.5854965, "medianRequestTime" :1.448463, "75thPcRequestTime" :1.815407, "95thPcRequestTime" :3.08085125, "99thPcRequestTime" :4.510461, "999thPcRequestTime" :4.510461}], "overseer_internal_queue" :[ "peek" ,{ "totalTime" :0.734537, "avgRequestsPerMinute" :0.35570294684187176, "5minRateReqsPerMinute" :7.158067076339619, "15minRateReqsPerMinute" :10.101505049683713, "avgTimePerRequest" :0.734537, "medianRequestTime" :0.734537, "75thPcRequestTime" :0.734537, "95thPcRequestTime" :0.734537, "99thPcRequestTime" :0.734537, "999thPcRequestTime" :0.734537}, "offer" ,{ "totalTime" :43.474119, "avgRequestsPerMinute" :19.2386904364582, "5minRateReqsPerMinute" :37.37089757629722, "15minRateReqsPerMinute" :44.081383699639126, "avgTimePerRequest" :0.8050762777777778, "medianRequestTime" :0.815022, "75thPcRequestTime" :1.017915, "95thPcRequestTime" :1.33674325, "99thPcRequestTime" :1.787279, "999thPcRequestTime" :1.787279}, "remove" ,{ "totalTime" :131.244284, "avgRequestsPerMinute" :31.35258075987943, "5minRateReqsPerMinute" :71.81843126653187, "15minRateReqsPerMinute" :86.9619197018428, "avgTimePerRequest" :1.4914123181818182, "medianRequestTime" :1.3513225, "75thPcRequestTime" :2.18225375, "95thPcRequestTime" :3.0428447, "99thPcRequestTime" :3.656696, "999thPcRequestTime" :3.656696}, "poll" ,{ "totalTime" :135.212298, "avgRequestsPerMinute" :31.352528139029836, "5minRateReqsPerMinute" :71.81843126653187, "15minRateReqsPerMinute" :86.9619197018428, "avgTimePerRequest" :1.5365033863636364, "medianRequestTime" :1.3905595000000002, "75thPcRequestTime" :2.2290725, "95thPcRequestTime" :3.0980611, "99thPcRequestTime" :3.714584, "999thPcRequestTime" :3.714584}], "collection_queue" :[ "remove_event" ,{ "totalTime" :34.544719, "avgRequestsPerMinute" :4.2920515037757445, "5minRateReqsPerMinute" :9.430673865735049, "15minRateReqsPerMinute" :11.051746440368753, "avgTimePerRequest" :3.4544718999999997, "medianRequestTime" :3.161874, "75thPcRequestTime" :4.8347845, "95thPcRequestTime" :5.08054, "99thPcRequestTime" :5.08054, "999thPcRequestTime" :5.08054}, "peek_wait_forever" ,{ "totalTime" :158944.84134, "avgRequestsPerMinute" :3.9126903584045136, "5minRateReqsPerMinute" :1.517316852098354, "15minRateReqsPerMinute" :0.607514604536483, "avgTimePerRequest" :14449.53103090909, "medianRequestTime" :11498.972881, "75thPcRequestTime" :27505.157498, "95thPcRequestTime" :35014.040773, "99thPcRequestTime" :35014.040773, "999thPcRequestTime" :35014.040773}]}
          Hide
          Shalin Shekhar Mangar added a comment -

          Added a very basic test.

          Show
          Shalin Shekhar Mangar added a comment - Added a very basic test.
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch adds tracking 10 most recent failures (with entire request/response) for each Collection API action. I think this along with the requeststatus API added in SOLR-5477 removes the need to expose entire logs.

          This can be committed now. In order to write/read stats from ZK, we need to be able to serialize Timer and related classes. I shall do that via a different issue.

          Show
          Shalin Shekhar Mangar added a comment - This patch adds tracking 10 most recent failures (with entire request/response) for each Collection API action. I think this along with the requeststatus API added in SOLR-5477 removes the need to expose entire logs. This can be committed now. In order to write/read stats from ZK, we need to be able to serialize Timer and related classes. I shall do that via a different issue.
          Hide
          Shalin Shekhar Mangar added a comment -

          Refactored the stats class a bit - replaced the multiple maps with a single one containing a custom class.

          Show
          Shalin Shekhar Mangar added a comment - Refactored the stats class a bit - replaced the multiple maps with a single one containing a custom class.
          Hide
          ASF subversion and git services added a comment -

          Commit 1580463 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1580463 ]

          SOLR-5749: A new Overseer status collection API exposes overseer queue sizes, timing statistics, success and error counts and last N failures per operation

          Show
          ASF subversion and git services added a comment - Commit 1580463 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1580463 ] SOLR-5749 : A new Overseer status collection API exposes overseer queue sizes, timing statistics, success and error counts and last N failures per operation
          Hide
          ASF subversion and git services added a comment -

          Commit 1580465 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1580465 ]

          SOLR-5749: Removed unused methods

          Show
          ASF subversion and git services added a comment - Commit 1580465 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1580465 ] SOLR-5749 : Removed unused methods
          Hide
          ASF subversion and git services added a comment -

          Commit 1580466 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1580466 ]

          SOLR-5749: A new Overseer status collection API exposes overseer queue sizes, timing statistics, success and error counts and last N failures per operation

          Show
          ASF subversion and git services added a comment - Commit 1580466 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1580466 ] SOLR-5749 : A new Overseer status collection API exposes overseer queue sizes, timing statistics, success and error counts and last N failures per operation
          Hide
          Otis Gospodnetic added a comment -

          Could this be exposed via JMX, like the rest of the stats?

          Could we improve naming a bit?
          e.g.
          + lst.add("15minRateReqsPerMinute", timer.getFifteenMinuteRate());
          + lst.add("avgTimePerRequest", timer.getMean());
          + lst.add("medianRequestTime", snapshot.getMedian());
          + lst.add("75thPcRequestTime", snapshot.get75thPercentile());

          Note:

          • Reqs vs. spelled our Request
          • Non-standard "Pc". I think people most often use "Pctl"
          Show
          Otis Gospodnetic added a comment - Could this be exposed via JMX, like the rest of the stats? Could we improve naming a bit? e.g. + lst.add("15minRateReqsPerMinute", timer.getFifteenMinuteRate()); + lst.add("avgTimePerRequest", timer.getMean()); + lst.add("medianRequestTime", snapshot.getMedian()); + lst.add("75thPcRequestTime", snapshot.get75thPercentile()); Note: Reqs vs. spelled our Request Non-standard "Pc". I think people most often use "Pctl"
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Thanks for reviewing Otis.

          Could this be exposed via JMX, like the rest of the stats?

          It'd probably be hard for someone to monitor it with jmx because the mbeans will be published only on the overseer node (which can change from time to time).

          The naming is actually copied over from the RequestHandlerBase.getStatistics, except that we track requests per minute instead of per second. I thought we could keep them same in both places for consistency. Would you still prefer to change?

          Show
          Shalin Shekhar Mangar added a comment - - edited Thanks for reviewing Otis. Could this be exposed via JMX, like the rest of the stats? It'd probably be hard for someone to monitor it with jmx because the mbeans will be published only on the overseer node (which can change from time to time). The naming is actually copied over from the RequestHandlerBase.getStatistics, except that we track requests per minute instead of per second. I thought we could keep them same in both places for consistency. Would you still prefer to change?
          Hide
          Otis Gospodnetic added a comment -

          It'd probably be hard for someone to monitor it with jmx because the mbeans will be published only on the overseer node (which can change from time to time).

          I think good monitoring tools won't have a problem with that. But if you expose it through a non-standard API, then it's harder for monitoring tools to get to this info because now they need to implement a mechanism to, in addition to getting data from JMX, also get this other stats from an alternative API with a custom response format.... which makes things messy.

          Re naming - I think good names and consistency is important. Applications get judged by how things are structured and named, too, not just whether they work or not or how well they work. Not seeing that consistency bugs me, but it won't break things...

          Show
          Otis Gospodnetic added a comment - It'd probably be hard for someone to monitor it with jmx because the mbeans will be published only on the overseer node (which can change from time to time). I think good monitoring tools won't have a problem with that. But if you expose it through a non-standard API, then it's harder for monitoring tools to get to this info because now they need to implement a mechanism to, in addition to getting data from JMX, also get this other stats from an alternative API with a custom response format.... which makes things messy. Re naming - I think good names and consistency is important. Applications get judged by how things are structured and named, too, not just whether they work or not or how well they work. Not seeing that consistency bugs me, but it won't break things...
          Hide
          Shalin Shekhar Mangar added a comment -

          I think good monitoring tools won't have a problem with that. But if you expose it through a non-standard API, then it's harder for monitoring tools to get to this info because now they need to implement a mechanism to, in addition to getting data from JMX, also get this other stats from an alternative API with a custom response format.... which makes things messy.

          That makes sense. I'll open an issue to add a jmx bean.

          Applications get judged by how things are structured and named, too, not just whether they work or not or how well they work. Not seeing that consistency bugs me, but it won't break things...

          Okay, we can change the names too

          Show
          Shalin Shekhar Mangar added a comment - I think good monitoring tools won't have a problem with that. But if you expose it through a non-standard API, then it's harder for monitoring tools to get to this info because now they need to implement a mechanism to, in addition to getting data from JMX, also get this other stats from an alternative API with a custom response format.... which makes things messy. That makes sense. I'll open an issue to add a jmx bean. Applications get judged by how things are structured and named, too, not just whether they work or not or how well they work. Not seeing that consistency bugs me, but it won't break things... Okay, we can change the names too
          Hide
          Shalin Shekhar Mangar added a comment -

          I opened SOLR-5928 but I am busy with other issues right now so I won't get to it soon.

          Show
          Shalin Shekhar Mangar added a comment - I opened SOLR-5928 but I am busy with other issues right now so I won't get to it soon.
          Hide
          ASF subversion and git services added a comment -

          Commit 1584739 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1584739 ]

          SOLR-5749: Renamed some stat names

          Show
          ASF subversion and git services added a comment - Commit 1584739 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1584739 ] SOLR-5749 : Renamed some stat names
          Hide
          ASF subversion and git services added a comment -

          Commit 1584740 from shalin@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1584740 ]

          SOLR-5749: Renamed some stat names

          Show
          ASF subversion and git services added a comment - Commit 1584740 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1584740 ] SOLR-5749 : Renamed some stat names
          Hide
          Uwe Schindler added a comment -

          Close issue after release of 4.8.0

          Show
          Uwe Schindler added a comment - Close issue after release of 4.8.0

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development