Solr
  1. Solr
  2. SOLR-6554

Speed up overseer operations for collections with stateFormat > 1

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 5.0, 6.0
    • Fix Version/s: 5.0, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Right now (after SOLR-5473 was committed), a node watches a collection only if stateFormat=1 or if that node hosts at least one core belonging to that collection.

      This means that a node which is the overseer operates on all collections but watches only a few. So any read goes directly to zookeeper which slows down overseer operations.

      Let's have the overseer node watch all collections always and never remove those watches (except when the collection itself is deleted).

      1. SOLR-6554.patch
        143 kB
        Shalin Shekhar Mangar
      2. SOLR-6554.patch
        143 kB
        Shalin Shekhar Mangar
      3. SOLR-6554.patch
        142 kB
        Shalin Shekhar Mangar
      4. SOLR-6554.patch
        143 kB
        Shalin Shekhar Mangar
      5. SOLR-6554.patch
        145 kB
        Shalin Shekhar Mangar
      6. SOLR-6554.patch
        141 kB
        Shalin Shekhar Mangar
      7. SOLR-6554.patch
        144 kB
        Shalin Shekhar Mangar
      8. SOLR-6554.patch
        140 kB
        Shalin Shekhar Mangar
      9. SOLR-6554-batching-refactor.patch
        30 kB
        Shalin Shekhar Mangar
      10. SOLR-6554-batching-refactor.patch
        23 kB
        Shalin Shekhar Mangar
      11. SOLR-6554-batching-refactor.patch
        22 kB
        Shalin Shekhar Mangar
      12. SOLR-6554-batching-refactor.patch
        22 kB
        Shalin Shekhar Mangar
      13. SOLR-6554-workqueue-fixes.patch
        8 kB
        Shalin Shekhar Mangar
      14. SOLR-6554-workqueue-fixes.patch
        9 kB
        Shalin Shekhar Mangar
      15. SOLR-6554-workqueue-fixes.patch
        9 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Shalin Shekhar Mangar added a comment -

        I bit the bullet and refactored the overseer to be less of a mess than it is now.

        1. I grouped the cluster operations into cluster, collection, slice and replica and moved them to their own classes.
        2. A new class ZkStateWriter is introduced which uses ZkWriteCommand to update the clusterstate.
        3. Each overseer operation returns a ZkWriteCommand
        4. The force update of cluster state is no longer required inside the main overseer loop. It is read once at the start of the loop and then we use ZK compare-and-set to update the cluster states using the versions already read.
        5. The above also means that there is no need to watch every collection with stateFormat > 1 on the overseer node anymore.

        Todo

        1. There are some nocommits that need to be taken care of.
        2. Implement batching for collections with stateFormat > 1
        3. There are a few newer operations such as balanceSliceUnique which aren't implemented yet so some tests fail.
        4. More cleanup

        Here are some performance numbers using the new code vs branch_5x:

        Overseer queue size: 20000 state requests
        
        stateFormat = 1, With refactoring (trunk)
        =========================================
        250962 T13 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        250964 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 241639.501565
        250964 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.0041383582263057345
        250965 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        250965 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        250965 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 241639.501565
        250966 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 241639.501565
        250966 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 241639.501565
        250966 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 241639.501565
        250966 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 241639.501565
        250967 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 241639.501565
        250967 T13 oasc.OverseerTest.testPerformance op: am_i_leader, success: 163, failure: 0
        250967 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 27.778517
        250967 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 40.51109030620299
        250967 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 60.52909864226439
        250968 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 67.32475977643367
        250968 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.17042034969325154
        250968 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.127852
        250968 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.159049
        250968 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.20707859999999995
        250968 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.9894586799999168
        250968 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 6.591979
        250968 T13 oasc.OverseerTest.testPerformance op: update_state, success: 161, failure: 0
        250969 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 105.56181
        250969 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 40.02790612569949
        250969 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 48.0
        250969 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 48.0
        250970 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.6556634161490683
        250970 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.57833
        250970 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.7091475
        250970 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.9294200000000001
        250970 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 3.997095759999999
        250970 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 4.075775
        250971 T13 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        250972 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 24677.266392
        250972 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 4971.795405166019
        250972 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 4190.29864341858
        250972 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 3199.6744276725544
        250972 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 1.2338016295185241
        250973 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 1.1432815
        250973 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 1.57099125
        250973 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.8449494
        250973 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.155266570000001
        250973 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 4.228459292000002
        
        
        stateFormat = 1, Without refactoring (branch_5x):
        ============================================================================================
        
        281984 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        281985 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 256532.804054
        281986 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.0038981033718983866
        281986 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        281986 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        281987 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 256532.804054
        281987 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 256532.804054
        281987 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 256532.804054
        281988 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 256532.804054
        281988 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 256532.804054
        281988 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 256532.804054
        281989 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        281990 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 14883.542675
        281990 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 4679.183238623399
        281990 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 4042.064261836551
        281990 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 3041.931602459868
        281990 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.7441399267536623
        281991 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.6902204999999999
        281991 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.9253057499999999
        281991 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.0441081499999998
        281991 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.2398173200000007
        281991 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 1.4717788410000001
        281991 T11 oasc.OverseerTest.testPerformance op: update_state, success: 171, failure: 0
        281992 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 108.786403
        281992 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 40.00714188468187
        281992 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 48.0
        281992 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 48.0
        281992 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.6361777953216374
        281992 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.615385
        281992 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.731113
        281992 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.8640954000000003
        281993 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.1159375600000034
        281993 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 4.361583
        281993 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 172, failure: 0
        281993 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 23.880482
        281993 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 40.23061331599097
        281993 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 60.11834593132536
        281994 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 67.11122870391561
        281994 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.13884001162790696
        281994 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.14114549999999998
        281994 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.1714525
        281994 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.24251024999999998
        281994 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.2920699700000001
        281994 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 0.303612
        
        
        stateFormat = 2, 10 collections, With refactoring (trunk):
        ===========================================================
        359321 T13 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        359322 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 336222.016107
        359323 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.0029742060387324683
        359323 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        359323 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        359324 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 336222.016107
        359324 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 336222.016107
        359324 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 336222.016107
        359325 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 336222.016107
        359325 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 336222.016107
        359325 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 336222.016107
        359325 T13 oasc.OverseerTest.testPerformance op: am_i_leader, success: 19898, failure: 0
        359326 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 2910.821076
        359326 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3551.0282475508393
        359326 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3007.215001818593
        359327 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1628.5124596704984
        359327 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.14628711810232184
        359327 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.127051
        359327 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.16182850000000001
        359327 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.22852089999999997
        359327 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.30072901000000013
        359327 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 1.135954577000003
        359327 T13 oasc.OverseerTest.testPerformance op: update_state, success: 19896, failure: 0
        359328 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 14968.90839
        359328 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3551.180599528965
        359329 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3006.8437316469813
        359329 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1628.3818327713007
        359329 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.7523576794330519
        359329 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.7057765
        359329 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.8617165
        359329 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.03365505
        359329 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.1740241200000008
        359329 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 4.014327310000006
        359330 T13 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        359330 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 25670.317603
        359330 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3572.411123496283
        359330 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3451.4741046935114
        359331 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 2559.18706643615
        359331 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 1.2834517075646217
        359331 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 1.2092995
        359331 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 1.59053175
        359331 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.8313089999999999
        359331 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.0310246800000007
        359331 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 2.4514601340000004
        
        
        stateFormat = 2, 10 collections, Without refactoring (branch_5x):
        =================================================================
        
        408300 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        408302 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 384185.906373
        408302 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.0026028952612532213
        408302 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        408302 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        408303 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 384185.906373
        408303 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 384185.906373
        408303 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 384185.906373
        408303 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 384185.906373
        408303 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 384185.906373
        408304 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 384185.906373
        408304 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        408306 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 37886.743042
        408306 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3127.565044455818
        408306 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3017.721755102994
        408307 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 2191.9117215274555
        408307 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 1.8942424399780011
        408307 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 1.805034
        408307 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 2.33467975
        408307 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 2.72064335
        408308 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 3.380797200000005
        408308 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 6.764998426000004
        408308 T11 oasc.OverseerTest.testPerformance op: update_state, success: 20011, failure: 0
        408310 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 15664.348411
        408310 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3125.8333680772357
        408310 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3008.770952686416
        408311 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 2188.5152728989856
        408311 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.782786887761731
        408311 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.7298905
        408311 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.8860805
        408311 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.0464787999999998
        408312 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.2055892300000004
        408312 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 4.504650318000003
        408312 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 20013, failure: 0
        408313 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 3008.698855
        408313 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3125.585466114684
        408313 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3006.128012707961
        408313 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 2187.5621767051953
        408314 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.15033722355468945
        408314 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.1373615
        408314 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.17162775
        408314 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.23810549999999997
        408314 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.28888714000000004
        408314 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 0.36018958200000006
        
        
        stateFormat = 2, 100 collections, With refactoring (trunk):
        ===========================================================
        353683 T13 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        353685 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 344294.509037
        353686 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.0029044719408401407
        353686 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        353686 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        353687 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 344294.509037
        353687 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 344294.509037
        353687 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 344294.509037
        353687 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 344294.509037
        353688 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 344294.509037
        353688 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 344294.509037
        353688 T13 oasc.OverseerTest.testPerformance op: am_i_leader, success: 19908, failure: 0
        353690 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 2899.55199
        353690 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3471.041899147638
        353690 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 2696.4544094117955
        353690 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1206.322204620017
        353690 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.1456475783604581
        353691 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.1307565
        353691 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.16647275
        353691 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.24111169999999996
        353691 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.29323824000000015
        353691 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 2.1718927580000065
        353691 T13 oasc.OverseerTest.testPerformance op: update_state, success: 19906, failure: 0
        353693 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 16055.477415
        353693 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3471.500723620727
        353693 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 2706.077181788675
        353693 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1216.7129526167616
        353693 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.8065647249572993
        353694 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.7490195
        353694 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.93685775
        353694 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.1227754
        353694 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.2775636000000001
        353694 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 5.825008328000003
        353694 T13 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        353695 T13 oasc.OverseerTest.printTimingStats 	 totalTime: 28055.800755
        353695 T13 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3509.7301625656414
        353696 T13 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3387.1127858679465
        353696 T13 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 2517.7854888232214
        353696 T13 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 1.4027199017549121
        353696 T13 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 1.283997
        353696 T13 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 1.751609
        353696 T13 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.98747225
        353697 T13 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.2288728700000013
        353697 T13 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 2.677476075
        
        
        stateFormat = 2, 100 collections, Without refactoring (branch_5x):
        ==================================================================
        642467 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        642468 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 592456.582081
        642471 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.0016878842331991438
        642471 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        642471 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        642471 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 592456.582081
        642471 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 592456.582081
        642471 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 592456.582081
        642471 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 592456.582081
        642472 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 592456.582081
        642472 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 592456.582081
        642472 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        642473 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 38316.042986
        642473 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 2039.7101580674453
        642473 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 2225.3413952254627
        642473 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1666.2656764044962
        642474 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 1.9157063639818008
        642474 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 1.9428355
        642474 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 2.363384
        642474 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 2.6738654
        642474 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.9951339600000004
        642474 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 4.165273211000003
        642474 T11 oasc.OverseerTest.testPerformance op: update_state, success: 20101, failure: 0
        642475 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 16713.560223
        642476 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 2035.9152283337185
        642476 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 2256.420345106015
        642476 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1779.6120090712736
        642476 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.8314790419879609
        642476 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.800087
        642476 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.92331125
        642476 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.08387195
        642476 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.1850831800000001
        642476 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 1.7994323630000015
        642477 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 20103, failure: 0
        642478 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 3164.627444
        642478 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 2035.9212322908609
        642478 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 2258.1746414754753
        642478 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1785.9179009788268
        642478 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.15742065582251405
        642478 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.141227
        642478 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.17663375
        642479 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.23308624999999997
        642479 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.31163878
        642479 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 4.382885935000004
        

        For the worst case (100 collections with stateFormat=2), processing time went down from 592456ms to 344294ms.

        The state processing time has gone up in my patch which I'll fix.

        Show
        Shalin Shekhar Mangar added a comment - I bit the bullet and refactored the overseer to be less of a mess than it is now. I grouped the cluster operations into cluster, collection, slice and replica and moved them to their own classes. A new class ZkStateWriter is introduced which uses ZkWriteCommand to update the clusterstate. Each overseer operation returns a ZkWriteCommand The force update of cluster state is no longer required inside the main overseer loop. It is read once at the start of the loop and then we use ZK compare-and-set to update the cluster states using the versions already read. The above also means that there is no need to watch every collection with stateFormat > 1 on the overseer node anymore. Todo There are some nocommits that need to be taken care of. Implement batching for collections with stateFormat > 1 There are a few newer operations such as balanceSliceUnique which aren't implemented yet so some tests fail. More cleanup Here are some performance numbers using the new code vs branch_5x: Overseer queue size: 20000 state requests stateFormat = 1, With refactoring (trunk) ========================================= 250962 T13 oasc.OverseerTest.testPerformance Overseer loop finished processing: 250964 T13 oasc.OverseerTest.printTimingStats totalTime: 241639.501565 250964 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.0041383582263057345 250965 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 250965 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 250965 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 241639.501565 250966 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 241639.501565 250966 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 241639.501565 250966 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 241639.501565 250966 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 241639.501565 250967 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 241639.501565 250967 T13 oasc.OverseerTest.testPerformance op: am_i_leader, success: 163, failure: 0 250967 T13 oasc.OverseerTest.printTimingStats totalTime: 27.778517 250967 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 40.51109030620299 250967 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 60.52909864226439 250968 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 67.32475977643367 250968 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.17042034969325154 250968 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 0.127852 250968 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.159049 250968 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.20707859999999995 250968 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.9894586799999168 250968 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 6.591979 250968 T13 oasc.OverseerTest.testPerformance op: update_state, success: 161, failure: 0 250969 T13 oasc.OverseerTest.printTimingStats totalTime: 105.56181 250969 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 40.02790612569949 250969 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 48.0 250969 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 48.0 250970 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.6556634161490683 250970 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 0.57833 250970 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.7091475 250970 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.9294200000000001 250970 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 3.997095759999999 250970 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 4.075775 250971 T13 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 250972 T13 oasc.OverseerTest.printTimingStats totalTime: 24677.266392 250972 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 4971.795405166019 250972 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 4190.29864341858 250972 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 3199.6744276725544 250972 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 1.2338016295185241 250973 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 1.1432815 250973 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 1.57099125 250973 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.8449494 250973 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.155266570000001 250973 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 4.228459292000002 stateFormat = 1, Without refactoring (branch_5x): ============================================================================================ 281984 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 281985 T11 oasc.OverseerTest.printTimingStats totalTime: 256532.804054 281986 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.0038981033718983866 281986 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 281986 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 281987 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 256532.804054 281987 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 256532.804054 281987 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 256532.804054 281988 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 256532.804054 281988 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 256532.804054 281988 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 256532.804054 281989 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 281990 T11 oasc.OverseerTest.printTimingStats totalTime: 14883.542675 281990 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 4679.183238623399 281990 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 4042.064261836551 281990 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 3041.931602459868 281990 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.7441399267536623 281991 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.6902204999999999 281991 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.9253057499999999 281991 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.0441081499999998 281991 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.2398173200000007 281991 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 1.4717788410000001 281991 T11 oasc.OverseerTest.testPerformance op: update_state, success: 171, failure: 0 281992 T11 oasc.OverseerTest.printTimingStats totalTime: 108.786403 281992 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 40.00714188468187 281992 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 48.0 281992 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 48.0 281992 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.6361777953216374 281992 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.615385 281992 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.731113 281992 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.8640954000000003 281993 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.1159375600000034 281993 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 4.361583 281993 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 172, failure: 0 281993 T11 oasc.OverseerTest.printTimingStats totalTime: 23.880482 281993 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 40.23061331599097 281993 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 60.11834593132536 281994 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 67.11122870391561 281994 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.13884001162790696 281994 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.14114549999999998 281994 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.1714525 281994 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.24251024999999998 281994 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.2920699700000001 281994 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 0.303612 stateFormat = 2, 10 collections, With refactoring (trunk): =========================================================== 359321 T13 oasc.OverseerTest.testPerformance Overseer loop finished processing: 359322 T13 oasc.OverseerTest.printTimingStats totalTime: 336222.016107 359323 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.0029742060387324683 359323 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 359323 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 359324 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 336222.016107 359324 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 336222.016107 359324 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 336222.016107 359325 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 336222.016107 359325 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 336222.016107 359325 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 336222.016107 359325 T13 oasc.OverseerTest.testPerformance op: am_i_leader, success: 19898, failure: 0 359326 T13 oasc.OverseerTest.printTimingStats totalTime: 2910.821076 359326 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3551.0282475508393 359326 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3007.215001818593 359327 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1628.5124596704984 359327 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.14628711810232184 359327 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 0.127051 359327 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.16182850000000001 359327 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.22852089999999997 359327 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.30072901000000013 359327 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 1.135954577000003 359327 T13 oasc.OverseerTest.testPerformance op: update_state, success: 19896, failure: 0 359328 T13 oasc.OverseerTest.printTimingStats totalTime: 14968.90839 359328 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3551.180599528965 359329 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3006.8437316469813 359329 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1628.3818327713007 359329 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.7523576794330519 359329 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 0.7057765 359329 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.8617165 359329 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.03365505 359329 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.1740241200000008 359329 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 4.014327310000006 359330 T13 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 359330 T13 oasc.OverseerTest.printTimingStats totalTime: 25670.317603 359330 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3572.411123496283 359330 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3451.4741046935114 359331 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 2559.18706643615 359331 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 1.2834517075646217 359331 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 1.2092995 359331 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 1.59053175 359331 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.8313089999999999 359331 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.0310246800000007 359331 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 2.4514601340000004 stateFormat = 2, 10 collections, Without refactoring (branch_5x): ================================================================= 408300 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 408302 T11 oasc.OverseerTest.printTimingStats totalTime: 384185.906373 408302 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.0026028952612532213 408302 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 408302 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 408303 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 384185.906373 408303 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 384185.906373 408303 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 384185.906373 408303 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 384185.906373 408303 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 384185.906373 408304 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 384185.906373 408304 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 408306 T11 oasc.OverseerTest.printTimingStats totalTime: 37886.743042 408306 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3127.565044455818 408306 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3017.721755102994 408307 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 2191.9117215274555 408307 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 1.8942424399780011 408307 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 1.805034 408307 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 2.33467975 408307 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 2.72064335 408308 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 3.380797200000005 408308 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 6.764998426000004 408308 T11 oasc.OverseerTest.testPerformance op: update_state, success: 20011, failure: 0 408310 T11 oasc.OverseerTest.printTimingStats totalTime: 15664.348411 408310 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3125.8333680772357 408310 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3008.770952686416 408311 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 2188.5152728989856 408311 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.782786887761731 408311 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.7298905 408311 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.8860805 408311 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.0464787999999998 408312 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.2055892300000004 408312 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 4.504650318000003 408312 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 20013, failure: 0 408313 T11 oasc.OverseerTest.printTimingStats totalTime: 3008.698855 408313 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3125.585466114684 408313 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3006.128012707961 408313 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 2187.5621767051953 408314 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.15033722355468945 408314 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.1373615 408314 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.17162775 408314 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.23810549999999997 408314 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.28888714000000004 408314 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 0.36018958200000006 stateFormat = 2, 100 collections, With refactoring (trunk): =========================================================== 353683 T13 oasc.OverseerTest.testPerformance Overseer loop finished processing: 353685 T13 oasc.OverseerTest.printTimingStats totalTime: 344294.509037 353686 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.0029044719408401407 353686 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 353686 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 353687 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 344294.509037 353687 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 344294.509037 353687 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 344294.509037 353687 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 344294.509037 353688 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 344294.509037 353688 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 344294.509037 353688 T13 oasc.OverseerTest.testPerformance op: am_i_leader, success: 19908, failure: 0 353690 T13 oasc.OverseerTest.printTimingStats totalTime: 2899.55199 353690 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3471.041899147638 353690 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 2696.4544094117955 353690 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1206.322204620017 353690 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.1456475783604581 353691 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 0.1307565 353691 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.16647275 353691 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.24111169999999996 353691 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.29323824000000015 353691 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 2.1718927580000065 353691 T13 oasc.OverseerTest.testPerformance op: update_state, success: 19906, failure: 0 353693 T13 oasc.OverseerTest.printTimingStats totalTime: 16055.477415 353693 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3471.500723620727 353693 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 2706.077181788675 353693 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1216.7129526167616 353693 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.8065647249572993 353694 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 0.7490195 353694 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.93685775 353694 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.1227754 353694 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.2775636000000001 353694 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 5.825008328000003 353694 T13 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 353695 T13 oasc.OverseerTest.printTimingStats totalTime: 28055.800755 353695 T13 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3509.7301625656414 353696 T13 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3387.1127858679465 353696 T13 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 2517.7854888232214 353696 T13 oasc.OverseerTest.printTimingStats avgTimePerRequest: 1.4027199017549121 353696 T13 oasc.OverseerTest.printTimingStats medianRequestTime: 1.283997 353696 T13 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 1.751609 353696 T13 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.98747225 353697 T13 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.2288728700000013 353697 T13 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 2.677476075 stateFormat = 2, 100 collections, Without refactoring (branch_5x): ================================================================== 642467 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 642468 T11 oasc.OverseerTest.printTimingStats totalTime: 592456.582081 642471 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.0016878842331991438 642471 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 642471 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 642471 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 592456.582081 642471 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 592456.582081 642471 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 592456.582081 642471 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 592456.582081 642472 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 592456.582081 642472 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 592456.582081 642472 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 642473 T11 oasc.OverseerTest.printTimingStats totalTime: 38316.042986 642473 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 2039.7101580674453 642473 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 2225.3413952254627 642473 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1666.2656764044962 642474 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 1.9157063639818008 642474 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 1.9428355 642474 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 2.363384 642474 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 2.6738654 642474 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.9951339600000004 642474 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 4.165273211000003 642474 T11 oasc.OverseerTest.testPerformance op: update_state, success: 20101, failure: 0 642475 T11 oasc.OverseerTest.printTimingStats totalTime: 16713.560223 642476 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 2035.9152283337185 642476 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 2256.420345106015 642476 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1779.6120090712736 642476 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.8314790419879609 642476 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.800087 642476 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.92331125 642476 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.08387195 642476 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.1850831800000001 642476 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 1.7994323630000015 642477 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 20103, failure: 0 642478 T11 oasc.OverseerTest.printTimingStats totalTime: 3164.627444 642478 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 2035.9212322908609 642478 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 2258.1746414754753 642478 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1785.9179009788268 642478 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.15742065582251405 642478 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.141227 642478 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.17663375 642479 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.23308624999999997 642479 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.31163878 642479 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 4.382885935000004 For the worst case (100 collections with stateFormat=2), processing time went down from 592456ms to 344294ms. The state processing time has gone up in my patch which I'll fix.
        Hide
        Shalin Shekhar Mangar added a comment -

        Here's a better patch which takes care of some of the nocommits:

        1. The live nodes information is read from the ZkStateReader's cluster state because the one in ZkStateWriter's cluster state might be stale
        2. We force refresh the clusterstate once at the beginning of the loop and then only if there's an error in the main loop.
        3. Added a simple TestClusterStateMutator, I'll add more.

        I still have a few ideas I'd like to try.

        Show
        Shalin Shekhar Mangar added a comment - Here's a better patch which takes care of some of the nocommits: The live nodes information is read from the ZkStateReader's cluster state because the one in ZkStateWriter's cluster state might be stale We force refresh the clusterstate once at the beginning of the loop and then only if there's an error in the main loop. Added a simple TestClusterStateMutator, I'll add more. I still have a few ideas I'd like to try.
        Hide
        Shalin Shekhar Mangar added a comment -

        Patch updated to trunk.

        Show
        Shalin Shekhar Mangar added a comment - Patch updated to trunk.
        Hide
        Shalin Shekhar Mangar added a comment -
        1. Removed most no-commits (except for the naive performance test in OverseerTest)
        2. Processes the work queue first in case of an error while writing to ZK (most likely a bad version)
        3. The BALANCESLICEUNIQUE tests pass (I just made it work but I didn't refactor it into a mutator-like class)

        I think we're very close. This might be a good time for people to review this patch.

        I am inclined to remove the separate work queue loop because work queue items are executed in the state update loop anyways. It will just need careful review of the synchronization.

        Show
        Shalin Shekhar Mangar added a comment - Removed most no-commits (except for the naive performance test in OverseerTest) Processes the work queue first in case of an error while writing to ZK (most likely a bad version) The BALANCESLICEUNIQUE tests pass (I just made it work but I didn't refactor it into a mutator-like class) I think we're very close. This might be a good time for people to review this patch. I am inclined to remove the separate work queue loop because work queue items are executed in the state update loop anyways. It will just need careful review of the synchronization.
        Hide
        Shalin Shekhar Mangar added a comment -

        Updated patch to trunk.

        There's a review request open if someone is interested. See https://reviews.apache.org/r/27872/

        Show
        Shalin Shekhar Mangar added a comment - Updated patch to trunk. There's a review request open if someone is interested. See https://reviews.apache.org/r/27872/
        Hide
        Shalin Shekhar Mangar added a comment -

        There's a test failure with the new replica properties feature which I had fixed earlier. Probably something got reverted during the merge. I'll take a look tomorrow.

        Show
        Shalin Shekhar Mangar added a comment - There's a test failure with the new replica properties feature which I had fixed earlier. Probably something got reverted during the merge. I'll take a look tomorrow.
        Hide
        Noble Paul added a comment -

        This is a much needed refatoring for Overseer. good jo. I have not gotten around to do a full review but I have seen enough of it and looks good. do we have batching of operations in stateFormat=2 now?

        Show
        Noble Paul added a comment - This is a much needed refatoring for Overseer. good jo. I have not gotten around to do a full review but I have seen enough of it and looks good. do we have batching of operations in stateFormat=2 now?
        Hide
        Shalin Shekhar Mangar added a comment -

        Here's a patch which passes all tests.

        Show
        Shalin Shekhar Mangar added a comment - Here's a patch which passes all tests.
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks for the review Noble. No, it doesn't batch stateFormat=2 right now but that's easy to add. I'll add it in the next patch.

        Show
        Shalin Shekhar Mangar added a comment - Thanks for the review Noble. No, it doesn't batch stateFormat=2 right now but that's easy to add. I'll add it in the next patch.
        Hide
        Shalin Shekhar Mangar added a comment -

        Here's a patch which adds batching of operations for stateFormat=2 as along as the consecutive messages operate on the same collection. It is not possible to preserve atomicity of writes for different collections because they are written to multiple zk nodes.

        I think this is ready. I'll commit this patch to trunk as a checkpoint. But I want to refactor a bit more such that the batching can be encapsulated inside ZkStateWriter further simplifying the overseer loop.

        Show
        Shalin Shekhar Mangar added a comment - Here's a patch which adds batching of operations for stateFormat=2 as along as the consecutive messages operate on the same collection. It is not possible to preserve atomicity of writes for different collections because they are written to multiple zk nodes. I think this is ready. I'll commit this patch to trunk as a checkpoint. But I want to refactor a bit more such that the batching can be encapsulated inside ZkStateWriter further simplifying the overseer loop.
        Hide
        Shalin Shekhar Mangar added a comment -

        The last patch had some extra changes to ShardSplitTest which I have removed. I have kept the overseer performance test inside OverseerTest but marked it as ignored. I'll commit this shortly.

        Show
        Shalin Shekhar Mangar added a comment - The last patch had some extra changes to ShardSplitTest which I have removed. I have kept the overseer performance test inside OverseerTest but marked it as ignored. I'll commit this shortly.
        Hide
        Mark Miller added a comment -

        Im not following this closely, so perhaps this is no help in this case, but there is a zk multi api call that lets you atomically write multiple zk nodes I believe.

        Show
        Mark Miller added a comment - Im not following this closely, so perhaps this is no help in this case, but there is a zk multi api call that lets you atomically write multiple zk nodes I believe.
        Hide
        Shalin Shekhar Mangar added a comment -

        Im not following this closely, so perhaps this is no help in this case, but there is a zk multi api call that lets you atomically write multiple zk nodes I believe.

        That is helpful. I didn't know that, thanks! If that is true then we can implement true batching for stateFormat=2 collections like what we have for the main cluster state. I'll commit this anyway and work on further changes as a second patch.

        Show
        Shalin Shekhar Mangar added a comment - Im not following this closely, so perhaps this is no help in this case, but there is a zk multi api call that lets you atomically write multiple zk nodes I believe. That is helpful. I didn't know that, thanks! If that is true then we can implement true batching for stateFormat=2 collections like what we have for the main cluster state. I'll commit this anyway and work on further changes as a second patch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1642437 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1642437 ]

        SOLR-6554: Speed up overseer operations

        Show
        ASF subversion and git services added a comment - Commit 1642437 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1642437 ] SOLR-6554 : Speed up overseer operations
        Hide
        ASF subversion and git services added a comment -

        Commit 1642443 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1642443 ]

        SOLR-6554: Removed one-liner methods and unused methods from overseer

        Show
        ASF subversion and git services added a comment - Commit 1642443 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1642443 ] SOLR-6554 : Removed one-liner methods and unused methods from overseer
        Hide
        Shalin Shekhar Mangar added a comment -

        I looked into the multi method in zk but it's limitation is that the combined payload of all the operations in a multi call must be less than 1MB (jute.maxbuffer limit) so it can be a bit tricky to implement.

        Show
        Shalin Shekhar Mangar added a comment - I looked into the multi method in zk but it's limitation is that the combined payload of all the operations in a multi call must be less than 1MB (jute.maxbuffer limit) so it can be a bit tricky to implement.
        Hide
        ASF subversion and git services added a comment -

        Commit 1642466 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1642466 ]

        SOLR-6554: Add a ClusterState.copyWith method which accepts a collection directly instead of a map of collections and use it throughout

        Show
        ASF subversion and git services added a comment - Commit 1642466 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1642466 ] SOLR-6554 : Add a ClusterState.copyWith method which accepts a collection directly instead of a map of collections and use it throughout
        Hide
        Shalin Shekhar Mangar added a comment -

        This patch further simplifies the overseer loop and encapsulates batching related code inside ZkStateWriter and refactors common code for message execution into a separate method. The updateState methods are no longer necessary inside Overseer. All tests pass.

        Show
        Shalin Shekhar Mangar added a comment - This patch further simplifies the overseer loop and encapsulates batching related code inside ZkStateWriter and refactors common code for message execution into a separate method. The updateState methods are no longer necessary inside Overseer. All tests pass.
        Hide
        Noble Paul added a comment -

        IMHO multi write is not really useful . The reason for batching is to minimize the number of listeners fired and the number of clusterstate updates in the listener nodes. So, whether you write those nodes in multiple ops or in a single op does not really matter

        Show
        Noble Paul added a comment - IMHO multi write is not really useful . The reason for batching is to minimize the number of listeners fired and the number of clusterstate updates in the listener nodes. So, whether you write those nodes in multiple ops or in a single op does not really matter
        Hide
        Shalin Shekhar Mangar added a comment -

        There was a bug in my last refactoring. We must have two different batching checks; one which is checked before the current message is processed and the second which is checked after. Otherwise there is no way to keep messages from different state formats from being batched together.

        Show
        Shalin Shekhar Mangar added a comment - There was a bug in my last refactoring. We must have two different batching checks; one which is checked before the current message is processed and the second which is checked after. Otherwise there is no way to keep messages from different state formats from being batched together.
        Hide
        Shalin Shekhar Mangar added a comment -

        IMHO multi write is not really useful . The reason for batching is to minimize the number of listeners fired and the number of clusterstate updates in the listener nodes. So, whether you write those nodes in multiple ops or in a single op does not really matter

        I think you may be right but it is worth testing out. The multi write will reduce writes to ZK and maybe reduce the number of watcher fires under the right conditions (interleaved messages for two different collections). I'll do a quick test to see if it helps.

        Show
        Shalin Shekhar Mangar added a comment - IMHO multi write is not really useful . The reason for batching is to minimize the number of listeners fired and the number of clusterstate updates in the listener nodes. So, whether you write those nodes in multiple ops or in a single op does not really matter I think you may be right but it is worth testing out. The multi write will reduce writes to ZK and maybe reduce the number of watcher fires under the right conditions (interleaved messages for two different collections). I'll do a quick test to see if it helps.
        Hide
        Shalin Shekhar Mangar added a comment -

        Fixed another bug with batching code where zk writes before processing items would be lost if the next message creates a new collection. All tests pass.

        Show
        Shalin Shekhar Mangar added a comment - Fixed another bug with batching code where zk writes before processing items would be lost if the next message creates a new collection. All tests pass.
        Hide
        Shalin Shekhar Mangar added a comment -

        I've added a test for the batching logic inside ZkStateWriter. This is ready.

        Show
        Shalin Shekhar Mangar added a comment - I've added a test for the batching logic inside ZkStateWriter. This is ready.
        Hide
        Ramkumar Aiyengar added a comment -

        Would be nice to see the updated perf numbers to confirm if the state change regression for stateFormat = 1 has gone away..

        Show
        Ramkumar Aiyengar added a comment - Would be nice to see the updated perf numbers to confirm if the state change regression for stateFormat = 1 has gone away..
        Hide
        ASF subversion and git services added a comment -

        Commit 1642693 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1642693 ]

        SOLR-6554: Refactored Overseer to improve reuse and put batching logic inside ZkStateWriter. Added a new ZkStateWriterTest.

        Show
        ASF subversion and git services added a comment - Commit 1642693 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1642693 ] SOLR-6554 : Refactored Overseer to improve reuse and put batching logic inside ZkStateWriter. Added a new ZkStateWriterTest.
        Hide
        Shalin Shekhar Mangar added a comment -

        bq, Would be nice to see the updated perf numbers to confirm if the state change regression for stateFormat = 1 has gone away..

        No, I suspect the slow down still exists but the overseer itself is faster by ~5% inspite of it. I expect the real performance to be even better when ZK runs on a different box. I still need to dig in to the slowdown though before I close this issue.

        Show
        Shalin Shekhar Mangar added a comment - bq, Would be nice to see the updated perf numbers to confirm if the state change regression for stateFormat = 1 has gone away.. No, I suspect the slow down still exists but the overseer itself is faster by ~5% inspite of it. I expect the real performance to be even better when ZK runs on a different box. I still need to dig in to the slowdown though before I close this issue.
        Hide
        Shalin Shekhar Mangar added a comment -

        Actually, the improvements in Overseer for stateFormat=1 (the default case) is much better than I expected. After the refactorings, the amILeader calls are very infrequent and the speed up is about 40%:

        Overseer queue size: 20000 state requests
        
        stateFormat = 1, With refactoring (trunk)
        =========================================
        
        216071 T12 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        216072 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 201411.465265
        216072 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.004964922311489345
        216073 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        216073 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        216073 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 201411.465265
        216073 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 201411.465265
        216073 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 201411.465265
        216074 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 201411.465265
        216074 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 201411.465265
        216074 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 201411.465265
        216075 T12 oasc.OverseerTest.testPerformance op: am_i_leader, success: 2, failure: 0
        216075 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 9.377281
        216075 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.5969575423185497
        216075 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 12.529098642264385
        216075 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 19.324759776433687
        216075 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 4.6886405
        216076 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 4.6886405
        216076 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 9.022041
        216076 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 9.022041
        216076 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 9.022041
        216076 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 9.022041
        216077 T12 oasc.OverseerTest.testPerformance op: update_state, success: 135, failure: 0
        216077 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 61.333751
        216077 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 40.31065112174398
        216077 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 48.0
        216078 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 48.0
        216078 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.4543240814814815
        216078 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.364217
        216078 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.409896
        216078 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.9332719999999994
        216079 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 3.576287319999995
        216079 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 3.700744
        216079 T12 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        216081 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 13344.072646
        216081 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 5973.226142698651
        216081 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 4437.949777291698
        216082 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 3247.958438006491
        216082 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.6671702737863107
        216083 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.6112960000000001
        216083 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.65861125
        216083 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.9373918
        216083 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.179823900000002
        216083 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 6.713780613000015
        
        
        stateFormat = 1, Without refactoring (branch_5x):
        ============================================================================================
        
        354435 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        354437 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 336777.887
        354438 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.0029692955509913457
        354438 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        354438 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        354439 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 336777.887
        354439 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 336777.887
        354439 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 336777.887
        354440 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 336777.887
        354440 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 336777.887
        354440 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 336777.887
        354441 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        354444 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 13029.408
        354444 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 3570.0750281584515
        354444 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 3169.209724490217
        354445 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 2124.6849108211077
        354445 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.6514378281085945
        354445 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.59
        354446 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.633
        354446 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.8480999999999999
        354446 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.9995200000000004
        354447 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 1.736079000000002
        354447 T11 oasc.OverseerTest.testPerformance op: update_state, success: 222, failure: 0
        354448 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 98.244
        354448 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 39.622607985461286
        354448 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 48.0
        354448 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 48.0
        354449 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.44254054054054054
        354449 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.3835
        354450 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.463
        354450 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.7994499999999999
        354450 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.2152900000000026
        354451 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 2.452
        354451 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 223, failure: 0
        354452 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 43.33
        354453 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 39.777330428482294
        354453 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 57.7576718337744
        354453 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 65.77963729636123
        354453 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.194304932735426
        354454 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.149
        354454 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.188
        354454 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.25839999999999996
        354454 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.47591999999999895
        354455 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 5.712
        

        Do not compare these numbers with the last ones because this test was run on a different box. Also trunk used jdk1.8.0_25 and branch_5x was run on jdk1.7.0_25. I'm running the other tests and I will report back shortly.

        Show
        Shalin Shekhar Mangar added a comment - Actually, the improvements in Overseer for stateFormat=1 (the default case) is much better than I expected. After the refactorings, the amILeader calls are very infrequent and the speed up is about 40%: Overseer queue size: 20000 state requests stateFormat = 1, With refactoring (trunk) ========================================= 216071 T12 oasc.OverseerTest.testPerformance Overseer loop finished processing: 216072 T12 oasc.OverseerTest.printTimingStats totalTime: 201411.465265 216072 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.004964922311489345 216073 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 216073 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 216073 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 201411.465265 216073 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 201411.465265 216073 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 201411.465265 216074 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 201411.465265 216074 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 201411.465265 216074 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 201411.465265 216075 T12 oasc.OverseerTest.testPerformance op: am_i_leader, success: 2, failure: 0 216075 T12 oasc.OverseerTest.printTimingStats totalTime: 9.377281 216075 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.5969575423185497 216075 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 12.529098642264385 216075 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 19.324759776433687 216075 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 4.6886405 216076 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 4.6886405 216076 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 9.022041 216076 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 9.022041 216076 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 9.022041 216076 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 9.022041 216077 T12 oasc.OverseerTest.testPerformance op: update_state, success: 135, failure: 0 216077 T12 oasc.OverseerTest.printTimingStats totalTime: 61.333751 216077 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 40.31065112174398 216077 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 48.0 216078 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 48.0 216078 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.4543240814814815 216078 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 0.364217 216078 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.409896 216078 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.9332719999999994 216079 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 3.576287319999995 216079 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 3.700744 216079 T12 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 216081 T12 oasc.OverseerTest.printTimingStats totalTime: 13344.072646 216081 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 5973.226142698651 216081 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 4437.949777291698 216082 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 3247.958438006491 216082 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.6671702737863107 216083 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 0.6112960000000001 216083 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.65861125 216083 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.9373918 216083 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.179823900000002 216083 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 6.713780613000015 stateFormat = 1, Without refactoring (branch_5x): ============================================================================================ 354435 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 354437 T11 oasc.OverseerTest.printTimingStats totalTime: 336777.887 354438 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.0029692955509913457 354438 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 354438 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 354439 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 336777.887 354439 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 336777.887 354439 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 336777.887 354440 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 336777.887 354440 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 336777.887 354440 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 336777.887 354441 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 354444 T11 oasc.OverseerTest.printTimingStats totalTime: 13029.408 354444 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 3570.0750281584515 354444 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 3169.209724490217 354445 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 2124.6849108211077 354445 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.6514378281085945 354445 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.59 354446 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.633 354446 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.8480999999999999 354446 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.9995200000000004 354447 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 1.736079000000002 354447 T11 oasc.OverseerTest.testPerformance op: update_state, success: 222, failure: 0 354448 T11 oasc.OverseerTest.printTimingStats totalTime: 98.244 354448 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 39.622607985461286 354448 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 48.0 354448 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 48.0 354449 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.44254054054054054 354449 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.3835 354450 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.463 354450 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.7994499999999999 354450 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.2152900000000026 354451 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 2.452 354451 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 223, failure: 0 354452 T11 oasc.OverseerTest.printTimingStats totalTime: 43.33 354453 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 39.777330428482294 354453 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 57.7576718337744 354453 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 65.77963729636123 354453 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.194304932735426 354454 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.149 354454 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.188 354454 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.25839999999999996 354454 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.47591999999999895 354455 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 5.712 Do not compare these numbers with the last ones because this test was run on a different box. Also trunk used jdk1.8.0_25 and branch_5x was run on jdk1.7.0_25. I'm running the other tests and I will report back shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1642708 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1642708 ]

        SOLR-6554: Moved performance test factors into constants

        Show
        ASF subversion and git services added a comment - Commit 1642708 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1642708 ] SOLR-6554 : Moved performance test factors into constants
        Hide
        Shalin Shekhar Mangar added a comment -

        Here is one more test with 10 collections all having stateFormat=2

        stateFormat = 2, 10 collections, With refactoring (trunk):
        ===========================================================
        
        244536 T12 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        244541 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 230151.969647
        244541 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.004344859792294027
        244542 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        244542 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        244542 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 230151.969647
        244543 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 230151.969647
        244543 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 230151.969647
        244543 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 230151.969647
        244543 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 230151.969647
        244544 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 230151.969647
        244544 T12 oasc.OverseerTest.testPerformance op: am_i_leader, success: 2, failure: 0
        244544 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 9.364505
        244544 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 0.5225137226108014
        244545 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 11.527327226155185
        244545 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 18.795347877757326
        244545 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 4.6822525
        244545 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 4.6822525
        244545 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 8.98267
        244546 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 8.98267
        244546 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 8.98267
        244546 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 8.98267
        244546 T12 oasc.OverseerTest.testPerformance op: update_state, success: 20011, failure: 0
        244548 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 12520.257356
        244549 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 5230.413197388874
        244549 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 4052.517943905357
        244550 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 3016.875734960303
        244550 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.6256687499875069
        244550 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.5419605000000001
        244551 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.6782735
        244551 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.07258755
        244551 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1.7485102400000028
        244551 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 8.31249523400001
        244552 T12 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        244554 T12 oasc.OverseerTest.printTimingStats 	 totalTime: 14652.769701
        244554 T12 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 5237.58427279227
        244554 T12 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 4130.052867937011
        244555 T12 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 3125.3532353495934
        244555 T12 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.7326018549572522
        244555 T12 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.6325495
        244555 T12 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.771651
        244556 T12 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.2006243999999988
        244556 T12 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.0507515900000057
        244556 T12 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 8.561998272000013
        
        
        stateFormat = 2, 10 collections, Without refactoring (branch_5x):
        =================================================================
        
        1329102 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 
        1329105 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 1312114.128
        1329106 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 7.621263475999054E-4
        1329107 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 0.0
        1329107 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 0.0
        1329107 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 1312114.128
        1329107 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 1312114.128
        1329108 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 1312114.128
        1329108 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1312114.128
        1329108 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 1312114.128
        1329108 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 1312114.128
        1329108 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0
        1329110 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 466104.173
        1329110 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 915.6800211767887
        1329111 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 1418.8474273248714
        1329111 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1043.4826929249386
        1329111 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 23.304043447827606
        1329111 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 2.3715
        1329111 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 2.7197500000000003
        1329111 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 3.977099999999994
        1329112 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 7.936150000000027
        1329112 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 35.16617800000001
        1329112 T11 oasc.OverseerTest.testPerformance op: update_state, success: 20011, failure: 0
        1329114 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 327388.98
        1329114 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 915.6585782350854
        1329114 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 1425.935654658605
        1329115 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1049.1858608633495
        1329115 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 16.360450752086354
        1329115 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.725
        1329116 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.844
        1329116 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 1.218
        1329117 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 2.1403600000000043
        1329117 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 12.222141000000011
        1329117 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 20013, failure: 0
        1329119 T11 oasc.OverseerTest.printTimingStats 	 totalTime: 3622.9
        1329120 T11 oasc.OverseerTest.printTimingStats 	 avgRequestsPerMinute: 915.5821178966359
        1329120 T11 oasc.OverseerTest.printTimingStats 	 5minRateRequestsPerMinute: 1424.927585830734
        1329120 T11 oasc.OverseerTest.printTimingStats 	 15minRateRequestsPerMinute: 1042.4347274913655
        1329121 T11 oasc.OverseerTest.printTimingStats 	 avgTimePerRequest: 0.18102733223404788
        1329121 T11 oasc.OverseerTest.printTimingStats 	 medianRequestTime: 0.179
        1329121 T11 oasc.OverseerTest.printTimingStats 	 75thPctlRequestTime: 0.197
        1329122 T11 oasc.OverseerTest.printTimingStats 	 95thPctlRequestTime: 0.25909999999999994
        1329122 T11 oasc.OverseerTest.printTimingStats 	 99thPctlRequestTime: 0.5761600000000034
        1329123 T11 oasc.OverseerTest.printTimingStats 	 999thPctlRequestTime: 17.419063000000026
        

        I'm not going to run more tests simply because they take too much time

        Show
        Shalin Shekhar Mangar added a comment - Here is one more test with 10 collections all having stateFormat=2 stateFormat = 2, 10 collections, With refactoring (trunk): =========================================================== 244536 T12 oasc.OverseerTest.testPerformance Overseer loop finished processing: 244541 T12 oasc.OverseerTest.printTimingStats totalTime: 230151.969647 244541 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.004344859792294027 244542 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 244542 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 244542 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 230151.969647 244543 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 230151.969647 244543 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 230151.969647 244543 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 230151.969647 244543 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 230151.969647 244544 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 230151.969647 244544 T12 oasc.OverseerTest.testPerformance op: am_i_leader, success: 2, failure: 0 244544 T12 oasc.OverseerTest.printTimingStats totalTime: 9.364505 244544 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 0.5225137226108014 244545 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 11.527327226155185 244545 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 18.795347877757326 244545 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 4.6822525 244545 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 4.6822525 244545 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 8.98267 244546 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 8.98267 244546 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 8.98267 244546 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 8.98267 244546 T12 oasc.OverseerTest.testPerformance op: update_state, success: 20011, failure: 0 244548 T12 oasc.OverseerTest.printTimingStats totalTime: 12520.257356 244549 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 5230.413197388874 244549 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 4052.517943905357 244550 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 3016.875734960303 244550 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.6256687499875069 244550 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 0.5419605000000001 244551 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.6782735 244551 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.07258755 244551 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1.7485102400000028 244551 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 8.31249523400001 244552 T12 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 244554 T12 oasc.OverseerTest.printTimingStats totalTime: 14652.769701 244554 T12 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 5237.58427279227 244554 T12 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 4130.052867937011 244555 T12 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 3125.3532353495934 244555 T12 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.7326018549572522 244555 T12 oasc.OverseerTest.printTimingStats medianRequestTime: 0.6325495 244555 T12 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.771651 244556 T12 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.2006243999999988 244556 T12 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.0507515900000057 244556 T12 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 8.561998272000013 stateFormat = 2, 10 collections, Without refactoring (branch_5x): ================================================================= 1329102 T11 oasc.OverseerTest.testPerformance Overseer loop finished processing: 1329105 T11 oasc.OverseerTest.printTimingStats totalTime: 1312114.128 1329106 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 7.621263475999054E-4 1329107 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 0.0 1329107 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 0.0 1329107 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 1312114.128 1329107 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 1312114.128 1329108 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 1312114.128 1329108 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1312114.128 1329108 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 1312114.128 1329108 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 1312114.128 1329108 T11 oasc.OverseerTest.testPerformance op: state, success: 20001, failure: 0 1329110 T11 oasc.OverseerTest.printTimingStats totalTime: 466104.173 1329110 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 915.6800211767887 1329111 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 1418.8474273248714 1329111 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1043.4826929249386 1329111 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 23.304043447827606 1329111 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 2.3715 1329111 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 2.7197500000000003 1329111 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 3.977099999999994 1329112 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 7.936150000000027 1329112 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 35.16617800000001 1329112 T11 oasc.OverseerTest.testPerformance op: update_state, success: 20011, failure: 0 1329114 T11 oasc.OverseerTest.printTimingStats totalTime: 327388.98 1329114 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 915.6585782350854 1329114 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 1425.935654658605 1329115 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1049.1858608633495 1329115 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 16.360450752086354 1329115 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.725 1329116 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.844 1329116 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 1.218 1329117 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 2.1403600000000043 1329117 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 12.222141000000011 1329117 T11 oasc.OverseerTest.testPerformance op: am_i_leader, success: 20013, failure: 0 1329119 T11 oasc.OverseerTest.printTimingStats totalTime: 3622.9 1329120 T11 oasc.OverseerTest.printTimingStats avgRequestsPerMinute: 915.5821178966359 1329120 T11 oasc.OverseerTest.printTimingStats 5minRateRequestsPerMinute: 1424.927585830734 1329120 T11 oasc.OverseerTest.printTimingStats 15minRateRequestsPerMinute: 1042.4347274913655 1329121 T11 oasc.OverseerTest.printTimingStats avgTimePerRequest: 0.18102733223404788 1329121 T11 oasc.OverseerTest.printTimingStats medianRequestTime: 0.179 1329121 T11 oasc.OverseerTest.printTimingStats 75thPctlRequestTime: 0.197 1329122 T11 oasc.OverseerTest.printTimingStats 95thPctlRequestTime: 0.25909999999999994 1329122 T11 oasc.OverseerTest.printTimingStats 99thPctlRequestTime: 0.5761600000000034 1329123 T11 oasc.OverseerTest.printTimingStats 999thPctlRequestTime: 17.419063000000026 I'm not going to run more tests simply because they take too much time
        Hide
        Mark Miller added a comment -

        or stateFormat = 1 has gone away

        What kind of confidence do we have for stateFormat = 2 now? It would really be nice to drop 1 from 5x rather than deal with both for a full major version again.

        Show
        Mark Miller added a comment - or stateFormat = 1 has gone away What kind of confidence do we have for stateFormat = 2 now? It would really be nice to drop 1 from 5x rather than deal with both for a full major version again.
        Hide
        Shalin Shekhar Mangar added a comment -

        What kind of confidence do we have for stateFormat = 2 now?

        I am pretty confident that it works and works well. But the client side needs a fix before 5.0 – SOLR-6521.

        It would really be nice to drop 1 from 5x rather than deal with both for a full major version again.

        I agree but how do we drop 1? I haven't thought through the migration process. It would probably need us to write the state to both places for one Solr release to auto-convert. Otherwise an upgrade would require down-time. We could definitely use stateFormat=2 as the default in 5.0 as it is already as performant as stateFormat=1 for the single collection use-case.

        Show
        Shalin Shekhar Mangar added a comment - What kind of confidence do we have for stateFormat = 2 now? I am pretty confident that it works and works well. But the client side needs a fix before 5.0 – SOLR-6521 . It would really be nice to drop 1 from 5x rather than deal with both for a full major version again. I agree but how do we drop 1? I haven't thought through the migration process. It would probably need us to write the state to both places for one Solr release to auto-convert. Otherwise an upgrade would require down-time. We could definitely use stateFormat=2 as the default in 5.0 as it is already as performant as stateFormat=1 for the single collection use-case.
        Hide
        Ramkumar Aiyengar added a comment -

        Nice, that's a pretty neat speedup

        +1 to make stateFormat=2 as the default. The idea of dropping support for stateFormat=1 makes me a bit nervous though – it should be tested in the wild for a while without users consciously opting in into it. May be schedule it for removal in 6 (thereby also giving users one full release to either get downtime or recreate indices)?

        Show
        Ramkumar Aiyengar added a comment - Nice, that's a pretty neat speedup +1 to make stateFormat=2 as the default. The idea of dropping support for stateFormat=1 makes me a bit nervous though – it should be tested in the wild for a while without users consciously opting in into it. May be schedule it for removal in 6 (thereby also giving users one full release to either get downtime or recreate indices)?
        Hide
        Shalin Shekhar Mangar added a comment -

        There is one problem in my current implementation. Earlier, the work queue was guaranteed to have only those items which had been processed but not written to ZK. This condition is not true anymore after my refactoring because the batching logic is inside ZkStateWriter and it will write to ZK without the Overseer knowing about it.

        Show
        Shalin Shekhar Mangar added a comment - There is one problem in my current implementation. Earlier, the work queue was guaranteed to have only those items which had been processed but not written to ZK. This condition is not true anymore after my refactoring because the batching logic is inside ZkStateWriter and it will write to ZK without the Overseer knowing about it.
        Hide
        Mark Miller added a comment -

        I don't think we need to support live migrations over major versions. I think this really holds back development. Major releases can break all kinds of back compat - I think the only thing we are on the hook for is a transition path - but not a live one. Live upgrades over major releases is not very realistic IMO.

        Show
        Mark Miller added a comment - I don't think we need to support live migrations over major versions. I think this really holds back development. Major releases can break all kinds of back compat - I think the only thing we are on the hook for is a transition path - but not a live one. Live upgrades over major releases is not very realistic IMO.
        Hide
        Mark Miller added a comment -

        Bq. Why not in 6?

        Supporting both over 5x will stunt development for an unknown, but probably pretty long amount of time.

        I think anyone that is nervous about it should probably wait on 4 for a few point releases. The first release of a major version jump is by definition going to have some larger changes and new bugs.

        Show
        Mark Miller added a comment - Bq. Why not in 6? Supporting both over 5x will stunt development for an unknown, but probably pretty long amount of time. I think anyone that is nervous about it should probably wait on 4 for a few point releases. The first release of a major version jump is by definition going to have some larger changes and new bugs.
        Hide
        Shalin Shekhar Mangar added a comment -
        1. Fixes the logic to add/remove items from workQueue so that the invariant is maintained
        2. Adds a ZkWriteListener interface which is used by Overseer to add/remove items from the workQueue depending on how/when state is flushed to ZK
        3. The earlier patches enabled batching on work queue processing but that is wrong because we do not have any fallback if a batch fails. So batching is disabled whenever we operate on items from the work queue.
        Show
        Shalin Shekhar Mangar added a comment - Fixes the logic to add/remove items from workQueue so that the invariant is maintained Adds a ZkWriteListener interface which is used by Overseer to add/remove items from the workQueue depending on how/when state is flushed to ZK The earlier patches enabled batching on work queue processing but that is wrong because we do not have any fallback if a batch fails. So batching is disabled whenever we operate on items from the work queue.
        Hide
        Shalin Shekhar Mangar added a comment -

        Slightly refactored. All tests pass.

        Show
        Shalin Shekhar Mangar added a comment - Slightly refactored. All tests pass.
        Hide
        Shalin Shekhar Mangar added a comment -

        Refactored a bit more. ZkWriteListener is called ZkWriteCallback instead. There was no need of a onFlushBefore and onFlushAfter methods. A single onWrite method is enough. I am not very happy with this callback API but this is the best I could come up. The other option would be to move the batching logic back to Overseer.

        Show
        Shalin Shekhar Mangar added a comment - Refactored a bit more. ZkWriteListener is called ZkWriteCallback instead. There was no need of a onFlushBefore and onFlushAfter methods. A single onWrite method is enough. I am not very happy with this callback API but this is the best I could come up. The other option would be to move the batching logic back to Overseer.
        Hide
        ASF subversion and git services added a comment -

        Commit 1646474 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1646474 ]

        SOLR-6554: Fix work queue handling

        Show
        ASF subversion and git services added a comment - Commit 1646474 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1646474 ] SOLR-6554 : Fix work queue handling
        Hide
        ASF subversion and git services added a comment -

        Commit 1646493 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1646493 ]

        SOLR-6554: Speed up overseer operations

        Show
        ASF subversion and git services added a comment - Commit 1646493 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1646493 ] SOLR-6554 : Speed up overseer operations
        Hide
        Shalin Shekhar Mangar added a comment -

        This is fixed. Thanks everyone.

        Let's discuss stateFormat=2 as default for 5.0 in it's own separate issue.

        Show
        Shalin Shekhar Mangar added a comment - This is fixed. Thanks everyone. Let's discuss stateFormat=2 as default for 5.0 in it's own separate issue.
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development