Well darn... it looked like things were fixed by the upgrade to 2.3.5, but then I looked a little closer.
I happened to notice that the hit rate was super high, when I designed the test to be closer to 50% (maxEntries = maxBlocks/2)
When I set these parameters in the test:
final int readLastBlockOdds=0; final boolean updateAnyway = false;
Results in something like this:
Done! # of Elements = 200 inserts=17234 removals=17034 hits=9982766 maxObservedSize=401
So for 10M multi-threaded reads, our hit rate was 99.8%, which artificially lowers the rate at which we insert new entries, and hence doesn't exercise the concurrency as well, leading to a passing test most of the time.
When I modified the test to increase the write concurrency again, accounting for a cache that is apparently too big:
final int readLastBlockOdds=10; final boolean updateAnyway = true;
The removal listener issues reappear:
WARNING: Exception thrown by removal listener
java.lang.RuntimeException: listener called more than once! k=103 v=org.apache.solr.store.blockcache.BlockCacheTest$Val@49dbc210 removalCause=SIZE
at org.apache.solr.store.blockcache.BlockCacheTest$$Lambda$5/498475569.onRemoval(Unknown Source)
at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$12/1297599052.run(Unknown Source)
at org.apache.solr.store.blockcache.BlockCacheTest$$Lambda$7/957914685.execute(Unknown Source)
Guarding against the removal listener being called more than once with the same entry also doesn't seem to work (same as before) since it then becomes apparent that some entries never get passed to the removal listener.
Even if the removal listener issues are fixed, the fact that the cache can be bigger than the configured size is a problem for us. The map itself is not storing the data, only controlling access to direct memory, so timely removal (and a timely call to the removal listener) under heavy concurrency is critical. Without that, the cache will cease to function as a LRU cache under load because we won't be able to find a free block int he direct memory to actually use.
Even with only 2 threads, I see the cache going to at least double the configured maxEntries. Is there a way to configure the size checking to be more strict?