Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10205

Evaluate and reduce BlockCache store failures

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.5, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      The BlockCache is written such that requests to cache a block (BlockCache.store call) can fail, making caching less effective. We should evaluate the impact of this storage failure and potentially reduce the number of storage failures.

      The implementation reserves a single block of memory. In store, a block of memory is allocated, and then a pointer is inserted into the underling map. A block is only freed when the underlying map evicts the map entry.
      This means that when two store() operations are called concurrently (even under low load), one can fail. This is made worse by the fact that concurrent maps typically tend to amortize the cost of eviction over many keys (i.e. the actual size of the map can grow beyond the configured maximum number of entries... both the older ConcurrentLinkedHashMap and newer Caffeine do this). When this is the case, store() won't be able to find a free block of memory, even if there aren't any other concurrently operating stores.

      1. cache_performance_test.txt
        33 kB
        Yonik Seeley
      2. SOLR-10205.patch
        3 kB
        Yonik Seeley
      3. SOLR-10205.patch
        34 kB
        Yonik Seeley
      4. SOLR-10205.patch
        32 kB
        Yonik Seeley

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Here's a snapshot of the modifications and tests I'm using to reduce store failures and evaluate the potential impact.

        It's messy... yet another hacked up version of the HDFS performance test I used to uncover the blockcache corruption issues. I don't plan on committing most of this. It's just for evaluating what should be committed to reduce store failures.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Here's a snapshot of the modifications and tests I'm using to reduce store failures and evaluate the potential impact. It's messy... yet another hacked up version of the HDFS performance test I used to uncover the blockcache corruption issues. I don't plan on committing most of this. It's just for evaluating what should be committed to reduce store failures.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I also plan on trying out using a higher number of reserved blocks (like 2 or 3) instead of the current 1. This helps because if 2 threads both try to cache blocks at the same time, one will grab the reserved block first, then the other will have to wait until an older entry is evicted from the map (caused by the fact that the first thread will insert a new entry).

        Show
        yseeley@gmail.com Yonik Seeley added a comment - I also plan on trying out using a higher number of reserved blocks (like 2 or 3) instead of the current 1. This helps because if 2 threads both try to cache blocks at the same time, one will grab the reserved block first, then the other will have to wait until an older entry is evicted from the map (caused by the fact that the first thread will insert a new entry).
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Here are the results of some random test runs.

        Each test run works by:

        1. Building a random index
        2. Running multiple query iterations. Each iteration looks like:
          1. create a random list of queries
          2. shuffle a list of test configurations (in this case, different BlockCache configurations)
          3. run the set of queries with multiple query threads
          4. the results (like the time for the set of queries to run) is aggregated for each configuration across multiple iterations

        Since these are random queries against a random index, results vary widely.
        The number of iterations also varies widely (because sometimes running a single set of queries against a single configuration just once takes 5-10min). The test times out after about 2 hrs.

        The configurations for these first tests varied two parameters:

        • when trying to find a free block, the number of times per loop we tried to call caffeine.cleanUp() if the map size >= maxEntries
        • whether to use an executor for map cleanup (the default) or specify Runnable::run (current thread is used for any submitted tasks)

        Some selected results:

        55.7/47.2 = 18%
        TEST=( time=532066 storeFails=384058 misses=4212598 executor=true cleanupTries=3 reserved=1) {size=1026,lookups=1216192323,hits=1166321317,evictions=39359922,storeFails=3800330,hitratio_current=0.96,lookups_persec=184208.98,hits_persec=176268.53,evictions_persec=6116.551,storeFails_persec=752.1843,time_delta=10.093270267,buffercache.allocations=0.0,buffercache.lost=0.0}
           [junit4]   2> iter=50 RESULTS:
           [junit4]   2> 	( time=557072 storeFails=1473124 misses=5151913 executor=true cleanupTries=0 reserved=1)
           [junit4]   2> 	( time=482894 storeFails=257414 misses=4090859 executor=false cleanupTries=0 reserved=1)
           [junit4]   2> 	( time=536910 storeFails=460293 misses=4275501 executor=true cleanupTries=1 reserved=1)
           [junit4]   2> 	( time=478086 storeFails=51078 misses=3934459 executor=false cleanupTries=1 reserved=1)
           [junit4]   2> 	( time=529849 storeFails=418456 misses=4238267 executor=true cleanupTries=2 reserved=1)
           [junit4]   2> 	( time=471511 storeFails=31507 misses=3918859 executor=false cleanupTries=2 reserved=1)
           [junit4]   2> 	( time=532066 storeFails=384058 misses=4212598 executor=true cleanupTries=3 reserved=1)
           [junit4]   2> 	( time=476136 storeFails=16950 misses=3900285 executor=false cleanupTries=3 reserved=1)
           [junit4]   2> 	( time=532881 storeFails=358717 misses=4188116 executor=true cleanupTries=4 reserved=1)
           [junit4]   2> 	( time=474018 storeFails=12556 misses=3893874 executor=false cleanupTries=4 reserved=1)
           [junit4]   2> 	( time=528446 storeFails=324929 misses=4162481 executor=true cleanupTries=5 reserved=1)
           [junit4]   2> 	( time=476876 storeFails=11248 misses=3901980 executor=false cleanupTries=5 reserved=1)
        
        32threads 65.7/55.8 = 17%
           [junit4]   2> ###### Testing took 53.464 seconds. Total queries=1096 Total matches=124437449
           [junit4]   2> TEST=( time=657474 storeFails=9079465 misses=20447634 executor=true cleanupTries=0 reserved=1) {size=1023,lookups=74615842
        8,hits=537166437,evictions=163933036,storeFails=21041927,hitratio_current=0.67,lookups_persec=96833.06,hits_persec=64891.32,evictions_perse
        c=15433.991,storeFails_persec=14261.533,time_delta=53.463745281,buffercache.allocations=0.0,buffercache.lost=0.0}
           [junit4]   2> iter=11 RESULTS:
           [junit4]   2>        ( time=657474 storeFails=9079465 misses=20447634 executor=true cleanupTries=0 reserved=1)
           [junit4]   2>        ( time=571034 storeFails=2082850 misses=17633145 executor=false cleanupTries=0 reserved=1)
           [junit4]   2>        ( time=617555 storeFails=2720476 misses=17615573 executor=true cleanupTries=1 reserved=1)
           [junit4]   2>        ( time=558536 storeFails=316721 misses=16978923 executor=false cleanupTries=1 reserved=1)
           [junit4]   2>        ( time=615668 storeFails=2258180 misses=17440287 executor=true cleanupTries=2 reserved=1)
           [junit4]   2>        ( time=560400 storeFails=75888 misses=16881946 executor=false cleanupTries=2 reserved=1)
           [junit4]   2>        ( time=612233 storeFails=1819890 misses=17284381 executor=true cleanupTries=3 reserved=1)
           [junit4]   2>        ( time=556029 storeFails=21173 misses=16840613 executor=false cleanupTries=3 reserved=1)
           [junit4]   2>        ( time=613037 storeFails=1476039 misses=17117779 executor=true cleanupTries=4 reserved=1)
           [junit4]   2>        ( time=556052 storeFails=10474 misses=16881301 executor=false cleanupTries=4 reserved=1)
           [junit4]   2>        ( time=611542 storeFails=1172765 misses=17019642 executor=true cleanupTries=5 reserved=1)
           [junit4]   2>        ( time=554806 storeFails=8005 misses=16850296 executor=false cleanupTries=5 reserved=1)
           [junit4]   2> ###### Starting 32 threads to test index.
        
        56.5/44.9 = 26%
           [junit4]   2> TEST=( time=448935 storeFails=11805 misses=6577890 executor=false cleanupTries=5 reserved=1) {size=1026,lookups=517573871,
        hits=434273499,evictions=66427980,storeFails=8180252,hitratio_current=0.84,lookups_persec=98465.53,hits_persec=82995.516,evictions_persec=1
        3616.733,storeFails_persec=26.554682,time_delta=110.338359367,buffercache.allocations=0.0,buffercache.lost=0.0}
           [junit4]   2> iter=3 RESULTS:
           [junit4]   2>        ( time=564653 storeFails=2763710 misses=8383855 executor=true cleanupTries=0 reserved=1)
           [junit4]   2>        ( time=476681 storeFails=703265 misses=7085675 executor=false cleanupTries=0 reserved=1)
           [junit4]   2>        ( time=538408 storeFails=1074397 misses=7120324 executor=true cleanupTries=1 reserved=1)
           [junit4]   2>        ( time=461370 storeFails=154650 misses=6712761 executor=false cleanupTries=1 reserved=1)
           [junit4]   2>        ( time=540369 storeFails=981297 misses=6973153 executor=true cleanupTries=2 reserved=1)
           [junit4]   2>        ( time=451366 storeFails=99016 misses=6643328 executor=false cleanupTries=2 reserved=1)
           [junit4]   2>        ( time=526381 storeFails=860927 misses=6900640 executor=true cleanupTries=3 reserved=1)
           [junit4]   2>        ( time=452509 storeFails=47727 misses=6630632 executor=false cleanupTries=3 reserved=1)
           [junit4]   2>        ( time=524049 storeFails=778166 misses=6851965 executor=true cleanupTries=4 reserved=1)
           [junit4]   2>        ( time=450483 storeFails=15949 misses=6601774 executor=false cleanupTries=4 reserved=1)
           [junit4]   2>        ( time=525711 storeFails=689340 misses=6809421 executor=true cleanupTries=5 reserved=1)
           [junit4]   2>        ( time=448935 storeFails=11805 misses=6577890 executor=false cleanupTries=5 reserved=1)
           [junit4]   2> ###### Starting 16 threads to test index.
        
        68.3/47.9 = 43%
           [junit4]   2> TEST=( time=479176 storeFails=26366 misses=14968859 executor=false cleanupTries=5 reserved=1) {size=1023,lookups=1080223338,hits=883635129,evictions=137110238
        ,storeFails=29268247,hitratio_current=0.83,lookups_persec=181366.17,hits_persec=149977.95,evictions_persec=26572.766,storeFails_persec=51.389427,time_delta=53.474033937,buffer
        cache.allocations=0.0,buffercache.lost=0.0}
           [junit4]   2> iter=7 RESULTS:
           [junit4]   2>        ( time=682973 storeFails=13699954 misses=25335624 executor=true cleanupTries=0 reserved=1)
           [junit4]   2>        ( time=504946 storeFails=2825071 misses=17380035 executor=false cleanupTries=0 reserved=1)
           [junit4]   2>        ( time=617386 storeFails=3449095 misses=16408244 executor=true cleanupTries=1 reserved=1)
           [junit4]   2>        ( time=491423 storeFails=418291 misses=15648428 executor=false cleanupTries=1 reserved=1)
           [junit4]   2>        ( time=608662 storeFails=2843998 misses=15869874 executor=true cleanupTries=2 reserved=1)
           [junit4]   2>        ( time=484558 storeFails=108380 misses=15318461 executor=false cleanupTries=2 reserved=1)
           [junit4]   2>        ( time=597450 storeFails=2348699 misses=15456154 executor=true cleanupTries=3 reserved=1)
           [junit4]   2>        ( time=485222 storeFails=44793 misses=15301369 executor=false cleanupTries=3 reserved=1)
           [junit4]   2>        ( time=595625 storeFails=1913073 misses=15151969 executor=true cleanupTries=4 reserved=1)
           [junit4]   2>        ( time=485603 storeFails=30201 misses=15239200 executor=false cleanupTries=4 reserved=1)
           [junit4]   2>        ( time=579955 storeFails=1560324 misses=14494648 executor=true cleanupTries=5 reserved=1)
           [junit4]   2>        ( time=479176 storeFails=26366 misses=14968859 executor=false cleanupTries=5 reserved=1)
           [junit4]   2> ###### Starting 32 threads to test index.
        
        71.1/67 = 6%
           [junit4]   2> TEST=( time=70218 storeFails=75701 misses=685973 executor=true cleanupTries=3 reserved=1) {size=1023,lookups=30157128,hits=22001616,evictions=6055373,storeFails=656528,hitratio_current=0.72,lookups_persec=30802.889,hits_persec=22328.277,evictions_persec=6049.655,storeFails_persec=1036.2858,time_delta=6.087124094,buffercache.allocations=0.0,buffercache.lost=0.0}
           [junit4]   2> iter=10 RESULTS:
           [junit4]   2> 	( time=71153 storeFails=227016 misses=755096 executor=true cleanupTries=0 reserved=1)
           [junit4]   2> 	( time=68022 storeFails=34926 misses=675982 executor=false cleanupTries=0 reserved=1)
           [junit4]   2> 	( time=69589 storeFails=88782 misses=691750 executor=true cleanupTries=1 reserved=1)
           [junit4]   2> 	( time=67931 storeFails=8711 misses=653859 executor=false cleanupTries=1 reserved=1)
           [junit4]   2> 	( time=69926 storeFails=80857 misses=689267 executor=true cleanupTries=2 reserved=1)
           [junit4]   2> 	( time=68225 storeFails=4527 misses=655974 executor=false cleanupTries=2 reserved=1)
           [junit4]   2> 	( time=70218 storeFails=75701 misses=685973 executor=true cleanupTries=3 reserved=1)
           [junit4]   2> 	( time=66837 storeFails=2137 misses=655777 executor=false cleanupTries=3 reserved=1)
           [junit4]   2> 	( time=69398 storeFails=71855 misses=693155 executor=true cleanupTries=4 reserved=1)
           [junit4]   2> 	( time=67337 storeFails=1822 misses=656747 executor=false cleanupTries=4 reserved=1)
           [junit4]   2> 	( time=69073 storeFails=58549 misses=681520 executor=true cleanupTries=5 reserved=1)
           [junit4]   2> 	( time=66978 storeFails=1645 misses=658712 executor=false cleanupTries=5 reserved=1)
           [junit4]   2> ###### Starting 32 threads to test index.
        

        I left out results with low iterations that resulted in high variability (need to fix test to try and account for this), as well as tests that resulted in a hitrate of 1 (boring).
        Thoughts:

        • not using an executor has the largest impact by far
        • trying to call map.cleanUp() at least once when there are no free entries is a win
        • the biggest performance gains tend to be when cache hit rates are relatively high (I guess when cache hit rates are low, the overall time is dominated by reading from HDFS)
        Show
        yseeley@gmail.com Yonik Seeley added a comment - Here are the results of some random test runs. Each test run works by: Building a random index Running multiple query iterations. Each iteration looks like: create a random list of queries shuffle a list of test configurations (in this case, different BlockCache configurations) run the set of queries with multiple query threads the results (like the time for the set of queries to run) is aggregated for each configuration across multiple iterations Since these are random queries against a random index, results vary widely. The number of iterations also varies widely (because sometimes running a single set of queries against a single configuration just once takes 5-10min). The test times out after about 2 hrs. The configurations for these first tests varied two parameters: when trying to find a free block, the number of times per loop we tried to call caffeine.cleanUp() if the map size >= maxEntries whether to use an executor for map cleanup (the default) or specify Runnable::run (current thread is used for any submitted tasks) Some selected results: 55.7/47.2 = 18% TEST=( time=532066 storeFails=384058 misses=4212598 executor= true cleanupTries=3 reserved=1) {size=1026,lookups=1216192323,hits=1166321317,evictions=39359922,storeFails=3800330,hitratio_current=0.96,lookups_persec=184208.98,hits_persec=176268.53,evictions_persec=6116.551,storeFails_persec=752.1843,time_delta=10.093270267,buffercache.allocations=0.0,buffercache.lost=0.0} [junit4] 2> iter=50 RESULTS: [junit4] 2> ( time=557072 storeFails=1473124 misses=5151913 executor= true cleanupTries=0 reserved=1) [junit4] 2> ( time=482894 storeFails=257414 misses=4090859 executor= false cleanupTries=0 reserved=1) [junit4] 2> ( time=536910 storeFails=460293 misses=4275501 executor= true cleanupTries=1 reserved=1) [junit4] 2> ( time=478086 storeFails=51078 misses=3934459 executor= false cleanupTries=1 reserved=1) [junit4] 2> ( time=529849 storeFails=418456 misses=4238267 executor= true cleanupTries=2 reserved=1) [junit4] 2> ( time=471511 storeFails=31507 misses=3918859 executor= false cleanupTries=2 reserved=1) [junit4] 2> ( time=532066 storeFails=384058 misses=4212598 executor= true cleanupTries=3 reserved=1) [junit4] 2> ( time=476136 storeFails=16950 misses=3900285 executor= false cleanupTries=3 reserved=1) [junit4] 2> ( time=532881 storeFails=358717 misses=4188116 executor= true cleanupTries=4 reserved=1) [junit4] 2> ( time=474018 storeFails=12556 misses=3893874 executor= false cleanupTries=4 reserved=1) [junit4] 2> ( time=528446 storeFails=324929 misses=4162481 executor= true cleanupTries=5 reserved=1) [junit4] 2> ( time=476876 storeFails=11248 misses=3901980 executor= false cleanupTries=5 reserved=1) 32threads 65.7/55.8 = 17% [junit4] 2> ###### Testing took 53.464 seconds. Total queries=1096 Total matches=124437449 [junit4] 2> TEST=( time=657474 storeFails=9079465 misses=20447634 executor= true cleanupTries=0 reserved=1) {size=1023,lookups=74615842 8,hits=537166437,evictions=163933036,storeFails=21041927,hitratio_current=0.67,lookups_persec=96833.06,hits_persec=64891.32,evictions_perse c=15433.991,storeFails_persec=14261.533,time_delta=53.463745281,buffercache.allocations=0.0,buffercache.lost=0.0} [junit4] 2> iter=11 RESULTS: [junit4] 2> ( time=657474 storeFails=9079465 misses=20447634 executor= true cleanupTries=0 reserved=1) [junit4] 2> ( time=571034 storeFails=2082850 misses=17633145 executor= false cleanupTries=0 reserved=1) [junit4] 2> ( time=617555 storeFails=2720476 misses=17615573 executor= true cleanupTries=1 reserved=1) [junit4] 2> ( time=558536 storeFails=316721 misses=16978923 executor= false cleanupTries=1 reserved=1) [junit4] 2> ( time=615668 storeFails=2258180 misses=17440287 executor= true cleanupTries=2 reserved=1) [junit4] 2> ( time=560400 storeFails=75888 misses=16881946 executor= false cleanupTries=2 reserved=1) [junit4] 2> ( time=612233 storeFails=1819890 misses=17284381 executor= true cleanupTries=3 reserved=1) [junit4] 2> ( time=556029 storeFails=21173 misses=16840613 executor= false cleanupTries=3 reserved=1) [junit4] 2> ( time=613037 storeFails=1476039 misses=17117779 executor= true cleanupTries=4 reserved=1) [junit4] 2> ( time=556052 storeFails=10474 misses=16881301 executor= false cleanupTries=4 reserved=1) [junit4] 2> ( time=611542 storeFails=1172765 misses=17019642 executor= true cleanupTries=5 reserved=1) [junit4] 2> ( time=554806 storeFails=8005 misses=16850296 executor= false cleanupTries=5 reserved=1) [junit4] 2> ###### Starting 32 threads to test index. 56.5/44.9 = 26% [junit4] 2> TEST=( time=448935 storeFails=11805 misses=6577890 executor= false cleanupTries=5 reserved=1) {size=1026,lookups=517573871, hits=434273499,evictions=66427980,storeFails=8180252,hitratio_current=0.84,lookups_persec=98465.53,hits_persec=82995.516,evictions_persec=1 3616.733,storeFails_persec=26.554682,time_delta=110.338359367,buffercache.allocations=0.0,buffercache.lost=0.0} [junit4] 2> iter=3 RESULTS: [junit4] 2> ( time=564653 storeFails=2763710 misses=8383855 executor= true cleanupTries=0 reserved=1) [junit4] 2> ( time=476681 storeFails=703265 misses=7085675 executor= false cleanupTries=0 reserved=1) [junit4] 2> ( time=538408 storeFails=1074397 misses=7120324 executor= true cleanupTries=1 reserved=1) [junit4] 2> ( time=461370 storeFails=154650 misses=6712761 executor= false cleanupTries=1 reserved=1) [junit4] 2> ( time=540369 storeFails=981297 misses=6973153 executor= true cleanupTries=2 reserved=1) [junit4] 2> ( time=451366 storeFails=99016 misses=6643328 executor= false cleanupTries=2 reserved=1) [junit4] 2> ( time=526381 storeFails=860927 misses=6900640 executor= true cleanupTries=3 reserved=1) [junit4] 2> ( time=452509 storeFails=47727 misses=6630632 executor= false cleanupTries=3 reserved=1) [junit4] 2> ( time=524049 storeFails=778166 misses=6851965 executor= true cleanupTries=4 reserved=1) [junit4] 2> ( time=450483 storeFails=15949 misses=6601774 executor= false cleanupTries=4 reserved=1) [junit4] 2> ( time=525711 storeFails=689340 misses=6809421 executor= true cleanupTries=5 reserved=1) [junit4] 2> ( time=448935 storeFails=11805 misses=6577890 executor= false cleanupTries=5 reserved=1) [junit4] 2> ###### Starting 16 threads to test index. 68.3/47.9 = 43% [junit4] 2> TEST=( time=479176 storeFails=26366 misses=14968859 executor= false cleanupTries=5 reserved=1) {size=1023,lookups=1080223338,hits=883635129,evictions=137110238 ,storeFails=29268247,hitratio_current=0.83,lookups_persec=181366.17,hits_persec=149977.95,evictions_persec=26572.766,storeFails_persec=51.389427,time_delta=53.474033937,buffer cache.allocations=0.0,buffercache.lost=0.0} [junit4] 2> iter=7 RESULTS: [junit4] 2> ( time=682973 storeFails=13699954 misses=25335624 executor= true cleanupTries=0 reserved=1) [junit4] 2> ( time=504946 storeFails=2825071 misses=17380035 executor= false cleanupTries=0 reserved=1) [junit4] 2> ( time=617386 storeFails=3449095 misses=16408244 executor= true cleanupTries=1 reserved=1) [junit4] 2> ( time=491423 storeFails=418291 misses=15648428 executor= false cleanupTries=1 reserved=1) [junit4] 2> ( time=608662 storeFails=2843998 misses=15869874 executor= true cleanupTries=2 reserved=1) [junit4] 2> ( time=484558 storeFails=108380 misses=15318461 executor= false cleanupTries=2 reserved=1) [junit4] 2> ( time=597450 storeFails=2348699 misses=15456154 executor= true cleanupTries=3 reserved=1) [junit4] 2> ( time=485222 storeFails=44793 misses=15301369 executor= false cleanupTries=3 reserved=1) [junit4] 2> ( time=595625 storeFails=1913073 misses=15151969 executor= true cleanupTries=4 reserved=1) [junit4] 2> ( time=485603 storeFails=30201 misses=15239200 executor= false cleanupTries=4 reserved=1) [junit4] 2> ( time=579955 storeFails=1560324 misses=14494648 executor= true cleanupTries=5 reserved=1) [junit4] 2> ( time=479176 storeFails=26366 misses=14968859 executor= false cleanupTries=5 reserved=1) [junit4] 2> ###### Starting 32 threads to test index. 71.1/67 = 6% [junit4] 2> TEST=( time=70218 storeFails=75701 misses=685973 executor= true cleanupTries=3 reserved=1) {size=1023,lookups=30157128,hits=22001616,evictions=6055373,storeFails=656528,hitratio_current=0.72,lookups_persec=30802.889,hits_persec=22328.277,evictions_persec=6049.655,storeFails_persec=1036.2858,time_delta=6.087124094,buffercache.allocations=0.0,buffercache.lost=0.0} [junit4] 2> iter=10 RESULTS: [junit4] 2> ( time=71153 storeFails=227016 misses=755096 executor= true cleanupTries=0 reserved=1) [junit4] 2> ( time=68022 storeFails=34926 misses=675982 executor= false cleanupTries=0 reserved=1) [junit4] 2> ( time=69589 storeFails=88782 misses=691750 executor= true cleanupTries=1 reserved=1) [junit4] 2> ( time=67931 storeFails=8711 misses=653859 executor= false cleanupTries=1 reserved=1) [junit4] 2> ( time=69926 storeFails=80857 misses=689267 executor= true cleanupTries=2 reserved=1) [junit4] 2> ( time=68225 storeFails=4527 misses=655974 executor= false cleanupTries=2 reserved=1) [junit4] 2> ( time=70218 storeFails=75701 misses=685973 executor= true cleanupTries=3 reserved=1) [junit4] 2> ( time=66837 storeFails=2137 misses=655777 executor= false cleanupTries=3 reserved=1) [junit4] 2> ( time=69398 storeFails=71855 misses=693155 executor= true cleanupTries=4 reserved=1) [junit4] 2> ( time=67337 storeFails=1822 misses=656747 executor= false cleanupTries=4 reserved=1) [junit4] 2> ( time=69073 storeFails=58549 misses=681520 executor= true cleanupTries=5 reserved=1) [junit4] 2> ( time=66978 storeFails=1645 misses=658712 executor= false cleanupTries=5 reserved=1) [junit4] 2> ###### Starting 32 threads to test index. I left out results with low iterations that resulted in high variability (need to fix test to try and account for this), as well as tests that resulted in a hitrate of 1 (boring). Thoughts: not using an executor has the largest impact by far trying to call map.cleanUp() at least once when there are no free entries is a win the biggest performance gains tend to be when cache hit rates are relatively high (I guess when cache hit rates are low, the overall time is dominated by reading from HDFS)
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        OK, here's an update to the tests I'm using. Some changes:

        • BlockCache now only tries calling cleanUp the next time through the loop (i.e. needs to fail to find a block first)
        • only call cleanUp if map size >= number_of_blocks (as opposed to max configured size of the map... this makes a difference when reserved_blocks>1)
        • test estimates number of queries to use in the next test round based on time so far
        • test exits by itself after a given amount of time (1hr is what I'm testing with)
        Show
        yseeley@gmail.com Yonik Seeley added a comment - OK, here's an update to the tests I'm using. Some changes: BlockCache now only tries calling cleanUp the next time through the loop (i.e. needs to fail to find a block first) only call cleanUp if map size >= number_of_blocks (as opposed to max configured size of the map... this makes a difference when reserved_blocks>1) test estimates number of queries to use in the next test round based on time so far test exits by itself after a given amount of time (1hr is what I'm testing with)
        Hide
        yseeley@gmail.com Yonik Seeley added a comment - - edited

        Here's the results (attached) of testing with different numbers of reserved blocks (up to 4) and different number of calls to cleanUp when the map size exceeds the number of blocks - reserved. Tests were done on systems with 16 and 32 logical (hyper-threaded) cores.

        The speedups compared to trunk range from 11% to 68% for these artificial random tests.

        Based on the results, I think the right balance is going with reserved blocks = 4 and a single call to cleanUp in the outer loop of BlockCache.findEmptyLocation()

        Show
        yseeley@gmail.com Yonik Seeley added a comment - - edited Here's the results (attached) of testing with different numbers of reserved blocks (up to 4) and different number of calls to cleanUp when the map size exceeds the number of blocks - reserved. Tests were done on systems with 16 and 32 logical (hyper-threaded) cores. The speedups compared to trunk range from 11% to 68% for these artificial random tests. Based on the results, I think the right balance is going with reserved blocks = 4 and a single call to cleanUp in the outer loop of BlockCache.findEmptyLocation()
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Final patch - I plan on committing shortly.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Final patch - I plan on committing shortly.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f2da342c47f8588996c7a68433a4e11131e46ee2 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f2da342 ]

        SOLR-10205: BlockCache - use 4 reserved blocks, don't use executor in caffeine, call cleanUp

        Show
        jira-bot ASF subversion and git services added a comment - Commit f2da342c47f8588996c7a68433a4e11131e46ee2 in lucene-solr's branch refs/heads/branch_6x from Yonik Seeley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f2da342 ] SOLR-10205 : BlockCache - use 4 reserved blocks, don't use executor in caffeine, call cleanUp
        Hide
        ben.manes Ben Manes added a comment -

        For writes you might prefer to use an atomic computation instead of a racy get-compute-put. The stampeding writers will cause a storm of removal notifications indicating the value was replaced. I think that would result in more frequently needing to free and acquire slots in the bank. This would reduce I/O costs as well, of course. Caffeine performs this by using a lock-free lookup that falls back to a computeIfAbsent, so that a hit won't thrash on locks if the entry is present.

        Show
        ben.manes Ben Manes added a comment - For writes you might prefer to use an atomic computation instead of a racy get-compute-put. The stampeding writers will cause a storm of removal notifications indicating the value was replaced. I think that would result in more frequently needing to free and acquire slots in the bank. This would reduce I/O costs as well, of course. Caffeine performs this by using a lock-free lookup that falls back to a computeIfAbsent, so that a hit won't thrash on locks if the entry is present.

          People

          • Assignee:
            yseeley@gmail.com Yonik Seeley
            Reporter:
            yseeley@gmail.com Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development