|
[
Permlink
| « Hide
]
Noble Paul added a comment - 30/Jul/08 06:15 AM - edited
A POC implementation based on ConcurrentHashMap
If somebody can review the implementation we can add another cache implementation to Solr.
Not entirely... there are a few memory barriers that make things thread safe, so I wouldn't call it "free" (since this branched off of another issue where some had the idea that one could get away without any sort of locks or memory barriers). It's a good approach in genera, and should scale better with many CPUs under very high lookup load. But I'm not sure that it should use a separate cleaner thread... and if it does, I don't think it should be scheduled. After we got those details worked out, then we'd need a SolrCache implementation that uses it. Given where we are in the release cycle (and that the cache contention issue is only affecting 1 person that I've seen), I think this should want until after 1.3
When SOLR adds object to cache or remove it from cache it does not change it, it manipulates with internal arrays of pointers to objects (which are probably atomic, but I don't know such JVM & GC internals in-depth...) Looks heavy with TreeSet...
Thanks for the comments. I agree with you . I devised this approach because some user reported that heavy cache lookups are slowing things down for him. The cost benefit analysis is yet to be studied. Separate cleaner thread is optional in the implementation . I am yet to study the cost of sorting over a hundred thousand entries. Do you recommend that the cleaner thread just keep running forever? That is fine, so there should be a sleep for the thread? BTW is the executorservice very expensive? I did not provide a SolrCache implementation because that is going to be drown the approach in details.
True. This is not marked for 1.3. But this can definitely live as a patch and anyone who needs it would benefit from it Fuad. You cannot trivialize concurrent programming so easily. Whatever BTW. Using TreeSet is not 'heavy' . It is the right tool for right On Thu, Jul 31, 2008 at 9:58 PM, Fuad Efendi (JIRA) <jira@apache.org> wrote: – Paul,
I have never ever suggested to use 'volatile' 'to avoid synchronization' for concurrent programming. I only noticed some extremely stupid code where SOLR uses _double_synchronization and AtomicLong inside: public synchronized Object put(Object key, Object value) { if (state == State.LIVE) { stats.inserts.incrementAndGet(); } synchronized (map) { // increment local inserts regardless of state??? // it does make it more consistent with the current size... inserts++; return map.put(key,value); } } Each tool has an area of applicability, and even ConcurrentHashMap just slightly intersects with SOLR needs; SOLR does not need 'consistent view at a point in time' on cached objects. 'volatile' is part of Java Specs, and implemented differently by different vendors. I use volatile (instead of more expensive AtomicLong) only and only to prevent JVM HotSpot Optimizer from some not-applicable staff...
A simple typo I think... a remnant from way back when changing what object was being synchronized on. That's why I like explicit synchronization rather than adding it to a method signature (easier to miss). I just fixed this to be public Object put(Object key, Object value) { synchronized (map) { if (state == State.LIVE) { stats.inserts.incrementAndGet(); } // increment local inserts regardless of state??? // it does make it more consistent with the current size... inserts++; return map.put(key,value); } } Thanks Yonik, I even guess that in some cases synchronization is faster than sun.misc.Unsafe.compareAndSwapLong(this, valueOffset, expect, update);
public final long incrementAndGet() { for (;;) { long current = get(); long next = current + 1; if (compareAndSet(current, next)) return next; } }
It can be 'long' inside synchronized block... If you don't need the synchronized block, then an atomic variable for "inserts" (for example) would be a big win.
But if you have the synchronized block anyway, it's probably faster to just expand it's scope if the operations to be done are simple. my findings on a simple perf test with no contention (single thread)
The code is there in the main() cache size 1 million with Hashmap
with ConcurrenthashHashMap
other observations Full SolrCache implementation
not tested the initialization parameters are same us the current LRUCache. I have also been considering a concurrent LRU cache for my own application and seeing this isse made me think about it again. Wouldn't one option be to use a ReentrantReadWriteLock to synchronise the map rather than complete synchronisation on the map for both readers and writers. Although that does not give a free get() it would at least allow concurrent get and still be able to use the LinkedHashMap and would not require the extra thread. Not sure if SOLR is java 1.5, but if not you could still use Doug Lea's concurrent package for pre 1.5 Java.
The patch contains a implementation which uses the java 5 features (ConcurrentHashMap) .It is better than using a separate Lock
name change and some refactoring
Here is a prototype of an idea I've had for a while for an efficient concurrent LRU cache.
It is completely untested... consider it more "code as design". It should feature faster cleaning - O( n ) when it works well. In addition to low and high water marks, it adds the concept of an "acceptable" water mark. A cleaning phase will try to go to the low water mark, but will be considered successful if it hits the acceptable water mark. This is coupled with a multi-pass cleaning phase. From the comments: // if we want to keep at least 1000 entries, then timestamps of // current through current-1000 are guaranteed not to be the oldest! // Also, if we want to remove 500 entries, then // oldestEntry through oldestEntry+500 are guaranteed to be // removed. The oldestEntry and newestEntry in the set of entries currently being considered is recorded for each phase. Entries that are new enough such that they are guaranteed to be kept are immediately removed from consideration, and entries that are old enough such that they are guaranteed to be removed are immediately removed (no sorting necessary). After 2 phases of this (configurable) and we still haven't removed enough entries, a priority queue is used to find the oldest entries out of those remaining. There are undoubtedly some other tricks we can use, but this was the best I could come up with for now. Looks good. This has a lot in common with my approach. The doClean() is done far more efficiently in your implementation . I can improve mine with your cleanup code (if you think it is fine)
+1 I'd also include the manual tracking of size() that mine did... the ConcurrentHashMap.size() doesn't look fast. Another thing to think about : pass an optional Executor in the constructor instead of creating a cleaning thread... and if it's null, it means "do it in the foreground". That would add flexibility and the ability to avoid one thread per cache if desired. Paul, Yonik, thanks for your efforts; BTW 'Concurrent'HashMap uses spinloops for 'safe' updates in order to avoid synchronization (instead of giving up CPU cycles); there are always cases when it is not faster that simple HashMap with synchronization.
LingPipe uses different approach, see last comment at Also, why are you in-a-loop with LRU? LFU is logically better. +1 and thanks for sharing. My implementation just uses a Map<K,V> internally. If we can get a Map implemenatation that is faster than ConcurrentHashMap (and concurrent) we can replace it (after seeing the performance).
borrowed some ideas from yonik's impl.
Probably this is a good enough first cut. Yonik – do you think this is good enough to go in now? Probably some users can try it out and report their experiences if we commit it early. I can take this up if you want.
I'll commit this shortly. Committed revision 708656.
Thanks Fuad, Noble and Yonik! Huge thanks on this one. This was one of the bottlenecks I've seen previously. For apples to apples load tests, this more then doubled my overall throughput.
I do notice a sort of "pulsing" in the responses. It appears that everything flies along, but on occasion everything piles up for a second, then starts going again. This leads to a few response times that are over 1 second, but the average is way down closer to 20ms. Is this the cleanup involved when hitting a high-water mark? Overall, it's a huge improvement.
yeah you are right.
Very good to hear that!
Yes, a cleanup is attempted when the size crosses the high watermark ('size' config parameter). It is done in two stages. In the first stage, least recently used items are evicted. If, after the first stage, the cache size is still greater than 'acceptableSize' config parameter (default is 0.95*maxSize), the second stage takes over. The second stage is more intensive and tries to bring down the cache size to the 'minSize' config parameter (default is 0.9*maxSize). Note that the cleanup is done in the same thread which calls a put on the cache, hence the 'pulsing' that you are seeing. The cache implementation supports using a separate cleanup thread too, however it is not used currently. We still need to evaluate the best way to use it and how much it can help. I'm not sure if that helped out. I haven't run a profiler yet ,but I think the "pulsing" (for lack of a better term) is caused by something else.
Here's why... I used the new FastLRUCache only for my Document cache in my latest test. The document cache size is large enough to hold all of the documents in the test set, helping focus on cache behavior. The warming query is enough to get all documents into the cache on startup. So, the cache is essentially as full as it's gonna get. 100% hit rate, and not growing. Yet, it still exhibits this pulsing. Could it be associated with the overhead of maintaining the least recently used entries? The introduction of the background thread didn't address this. It also didn't appear to speed things up, in fact it dropped overall throughput a bit, though still better then they old LRUCache.
The overhead is ~= 0. It just has to increment an AtomicLong everytime you do a get() .I suspect the 'pulsing' may be because of GC pauses. enable GC logging and you will know Todd, what's the used size of your Document cache?
I'll review this latest incarnation of ConcurrentLRUCache to see if there's anything that might cause this. Some minor updates:
Todd: have you verified that the hit rate stays at 100% (no new inserts, no new evictions, etc)? If so, it might just be GC as Noble suggests. Bigger heaps often increase GC pauses.
I'll hook up profiling and see if my GC overhead has changed, or is seeing big peaks. Should be later today.
My document cache is only 4000 for that test, but only about 3300 documents are in it. I wanted to focus on the overhead of getting things out of LRUCache, as in production we get >97% hit rates. I verified it's at 100% hit rate (warming fills it). Committed revision 709188.
Thanks Yonik! The markAndSweep logic in the current code didn't replicate the logic I gave in my example ConcurrentLRUCache, and it can remove a lot more than it should.
Specifically, my algorithm broke the results into 3 groups:
The current code reversed this logic, assuming that one can remove everything that is not in the top group. This isn't valid though, as lastAccess isn't uniform (and thus the size of the top group could be as small as 1). I ran with a profiler and I'm not seeing any bursts of garbage collection. It's at a steady ~2%, with no major collections occurring (which is great!). However, the use of the profiler also slows things down about 10-20 % which seems to be enough that the pulsing goes away. I believe the pulsing may be some sort of limit of the testing I'm doing on a limited local environment. I'll include the patch in my current Solr app and do a more accurate comparison with our production level load tests to see if it still exists. Though it may be a while before I can provide feedback from that one, as getting machines allocated for the heavy load testing can take a bit.
As I said before, this is a huge improvement. Thanks for all the work on this one. I'll take a crack at updating the patch such that too many entries aren't removed.
I missed the point . I guess this post made it clear . As you mentioned it should be a 3 step cleanup I forgot to add license headers to the three source files for this issue. I'll hold off adding them lest it breaks patches that you guys are working on.
yonik's suggestions implemented
creating an array[map.size()] for every markAndSweep() is expensive. Iterator should be better
Thanks Noble, we had a mid-air implementation collision
I'm doing some quick performance testing the version I wrote now...I'll try it against your version after and then we can go from there. OK, here's the implementation based on my previous pseudo code, along with a very quick performance test.
The test uses random keys over a slightly bigger range than the table. It also uses an upper water mark 10% higher than the lower water mark, and an acceptable water mark half way inbetween. I haven't experimented to see what the best acceptable water mark is for either impl. If anyone wants to do a more realistic test with real queries, they are welcome to it. I did 4 runs and took the lowest number for each sub-test. Java6 -server, WinXP, P4. Times in milliseconds. doPerfTest(2000000, 100000, 200000); // big cache noble=17063 yonik=9157 doPerfTest(2000000, 100000, 120000); // smaller key space increases distance between oldest, newest and makes the first passes less effective. noble=8391 yonik=5812 doPerfTest(6000000, 1000, 2000); // small cache, smaller hit rate noble=17578 yonik=12515 doPerfTest(6000000, 1000, 1200); // small cache, bigger hit rate noble=11500 yonik=8219 Nobe, your latest patch contains code like this:
if (!markAndSweepLock.tryLock()) return; long oldestItem = this.oldestItem; [...] markAndSweepLock.unlock(); this.oldestItem = oldestItem; oldestItem isn't volatile (and doesn't need to be if accessed correctly). Added some minor changes, making sure that minLimit >= 1 and limit >minLimit (needed for rounding with small cache sizes).
Also added test code for LRUCache vs FastLRUCache. It appears that LRUCache is faster (at least on my single proc PC) when the hit ratio is low, and FastLRUCache is faster when the hit ratio is high. Should FastLRUCache be made the default in the example schema for the filterCache? time=2937 impl=LRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=2266 impl=FastLRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=3594 impl=LRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816075 time=1484 impl=FastLRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=3203 impl=LRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835225 time=4593 impl=FastLRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.751506 time=3781 impl=LRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.834685 time=2656 impl=FastLRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8232835000000001 time=3234 impl=LRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.523398 time=5047 impl=FastLRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.3831675 time=4125 impl=LRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.511871 time=3969 impl=FastLRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.6665975 time=3390 impl=LRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.1445725 time=5687 impl=FastLRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10041049999999996 time=4750 impl=LRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10340150000000004 time=6875 impl=FastLRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.22233749999999997 time=1343 impl=LRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=860 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=1547 impl=LRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=703 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=1610 impl=LRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.833648 time=2406 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.7404839999999999 time=2078 impl=LRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8334255 time=859 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.998974 time=1922 impl=LRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5003285 time=2875 impl=FastLRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.3516785 time=2422 impl=LRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5002055000000001 time=1203 impl=FastLRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.821195 time=2297 impl=LRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10054949999999996 time=2969 impl=FastLRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.05416350000000003 time=3078 impl=LRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10003499999999999 time=3000 impl=FastLRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10475299999999999 Latest patch:
I'll commit soon if there are no objections. CacheEntry[] eset = new CacheEntry[sz]; int eSize = 0; Isn't it too expensive to create a potentially huge array every time we do a clean? (too much work for GC) .May be we do not even need it if the first loop is enough. Moreover this one thing has added more code. I didn't use lucene PriorityQueue because if somebody wished to lift the code they can easily do so if there is no other dependency. When I posted the requirement in google collections there was a lot of interest in such a component. Can TreeSet do the trick?
That's what benchmarking is for It's a single short-lived allocation that allows us to greatly reduce the number of elements we need to evaluate on successive passes. Inserts into a TreeSet may have a higher GC cost given it's an allocation per insert.
Right... although in my testing, it seemed like the first loop was rarely sufficient (although the second often was). Here's the performance test from the patch on a more recent machine – Intel Quad Core, RHEL 64-bit, Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_11-b03, mixed mode):
time=1456 impl=LRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=1041 impl=FastLRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=3256 impl=LRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=754 impl=FastLRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=1234 impl=LRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835225 time=1564 impl=FastLRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.751506 time=3728 impl=LRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835006 time=1384 impl=FastLRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.798109 time=1357 impl=LRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.523398 time=1894 impl=FastLRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.3831675 time=4556 impl=LRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.512785 time=1514 impl=FastLRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.4682115 time=1614 impl=LRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.1445725 time=1837 impl=FastLRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10041049999999996 time=4710 impl=LRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10963999999999996 time=1816 impl=FastLRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.11144399999999999 time=339 impl=LRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=292 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=2511 impl=LRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=351 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=383 impl=LRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.833648 time=580 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.7404839999999999 time=2716 impl=LRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8337875 time=805 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.79799 time=570 impl=LRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5003285 time=794 impl=FastLRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.3516785 time=3676 impl=LRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.49959549999999997 time=1685 impl=FastLRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.436728 time=712 impl=LRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10054949999999996 time=1022 impl=FastLRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.05416350000000003 time=4395 impl=LRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.100526 time=2562 impl=FastLRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.08556600000000003 With more number of threads this time (4 and 16): time=1794 impl=LRUCache nThreads= 4 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608 time=594 impl=FastLRUCache nThreads= 4 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816075 time=1737 impl=LRUCache nThreads= 16 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981607 time=602 impl=FastLRUCache nThreads= 16 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981602 time=2387 impl=LRUCache nThreads= 4 size=100000 maxKey=120000 gets=2000000 hitRatio=0.830956 time=866 impl=FastLRUCache nThreads= 4 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8892465 time=1793 impl=LRUCache nThreads= 16 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8274485 time=706 impl=FastLRUCache nThreads= 16 size=100000 maxKey=120000 gets=2000000 hitRatio=0.9586865 time=2233 impl=LRUCache nThreads= 4 size=100000 maxKey=200000 gets=2000000 hitRatio=0.5025255 time=1228 impl=FastLRUCache nThreads= 4 size=100000 maxKey=200000 gets=2000000 hitRatio=0.654153 time=1905 impl=LRUCache nThreads= 16 size=100000 maxKey=200000 gets=2000000 hitRatio=0.500583 time=883 impl=FastLRUCache nThreads= 16 size=100000 maxKey=200000 gets=2000000 hitRatio=0.9067965 time=5336 impl=LRUCache nThreads= 4 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10182199999999997 time=1780 impl=FastLRUCache nThreads= 4 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.25870800000000005 time=2911 impl=LRUCache nThreads= 16 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10132300000000005 time=1941 impl=FastLRUCache nThreads= 16 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.508488 time=687 impl=LRUCache nThreads= 4 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=421 impl=FastLRUCache nThreads= 4 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=782 impl=LRUCache nThreads= 16 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=452 impl=FastLRUCache nThreads= 16 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=813 impl=LRUCache nThreads= 4 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8333735 time=678 impl=FastLRUCache nThreads= 4 size=1000 maxKey=1200 gets=2000000 hitRatio=0.9988885 time=794 impl=LRUCache nThreads= 16 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8331635 time=503 impl=FastLRUCache nThreads= 16 size=1000 maxKey=1200 gets=2000000 hitRatio=0.977526 time=1554 impl=LRUCache nThreads= 4 size=1000 maxKey=2000 gets=2000000 hitRatio=0.500093 time=928 impl=FastLRUCache nThreads= 4 size=1000 maxKey=2000 gets=2000000 hitRatio=0.802332 time=1102 impl=LRUCache nThreads= 16 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5002759999999999 time=566 impl=FastLRUCache nThreads= 16 size=1000 maxKey=2000 gets=2000000 hitRatio=0.954131 time=1543 impl=LRUCache nThreads= 4 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10062899999999997 time=1039 impl=FastLRUCache nThreads= 4 size=1000 maxKey=10000 gets=2000000 hitRatio=0.7582409999999999 time=1372 impl=LRUCache nThreads= 16 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10031000000000001 time=604 impl=FastLRUCache nThreads= 16 size=1000 maxKey=10000 gets=2000000 hitRatio=0.935282 Now with 8 and 32 threads: time=2109 impl=LRUCache nThreads= 8 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816075 time=608 impl=FastLRUCache nThreads= 8 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981606 time=1502 impl=LRUCache nThreads= 32 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816045 time=648 impl=FastLRUCache nThreads= 32 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981592 time=3876 impl=LRUCache nThreads= 8 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8267995 time=748 impl=FastLRUCache nThreads= 8 size=100000 maxKey=120000 gets=2000000 hitRatio=0.915961 time=2176 impl=LRUCache nThreads= 32 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8271935 time=694 impl=FastLRUCache nThreads= 32 size=100000 maxKey=120000 gets=2000000 hitRatio=0.9652565 time=2038 impl=LRUCache nThreads= 8 size=100000 maxKey=200000 gets=2000000 hitRatio=0.5005305 time=1088 impl=FastLRUCache nThreads= 8 size=100000 maxKey=200000 gets=2000000 hitRatio=0.789179 time=2147 impl=LRUCache nThreads= 32 size=100000 maxKey=200000 gets=2000000 hitRatio=0.4997505 time=884 impl=FastLRUCache nThreads= 32 size=100000 maxKey=200000 gets=2000000 hitRatio=0.926915 time=2343 impl=LRUCache nThreads= 8 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10397699999999999 time=2207 impl=FastLRUCache nThreads= 8 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.34063 time=3440 impl=LRUCache nThreads= 32 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10123850000000001 time=2087 impl=FastLRUCache nThreads= 32 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.5367375 time=909 impl=LRUCache nThreads= 8 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=443 impl=FastLRUCache nThreads= 8 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=682 impl=LRUCache nThreads= 32 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=447 impl=FastLRUCache nThreads= 32 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065 time=1189 impl=LRUCache nThreads= 8 size=1000 maxKey=1200 gets=2000000 hitRatio=0.832726 time=605 impl=FastLRUCache nThreads= 8 size=1000 maxKey=1200 gets=2000000 hitRatio=0.919104 time=1463 impl=LRUCache nThreads= 32 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8337005 time=489 impl=FastLRUCache nThreads= 32 size=1000 maxKey=1200 gets=2000000 hitRatio=0.9845845 time=1256 impl=LRUCache nThreads= 8 size=1000 maxKey=2000 gets=2000000 hitRatio=0.500149 time=678 impl=FastLRUCache nThreads= 8 size=1000 maxKey=2000 gets=2000000 hitRatio=0.907774 time=1013 impl=LRUCache nThreads= 32 size=1000 maxKey=2000 gets=2000000 hitRatio=0.49962399999999996 time=503 impl=FastLRUCache nThreads= 32 size=1000 maxKey=2000 gets=2000000 hitRatio=0.976796 time=1504 impl=LRUCache nThreads= 8 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10030550000000005 time=754 impl=FastLRUCache nThreads= 8 size=1000 maxKey=10000 gets=2000000 hitRatio=0.9151345 time=1245 impl=LRUCache nThreads= 32 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10028899999999996 time=499 impl=FastLRUCache nThreads= 32 size=1000 maxKey=10000 gets=2000000 hitRatio=0.978823 When the number of threads are increased, FastLRUCache is true to its name That benchmark really isn't valid for a high number of threads though: notice the difference in hitRatio.
If you have many threads quickly adding items and only one thread at a time removing items, the FastLRUCache goes over it's target size and thus increases it's hitRatio, making it artificially faster. This isn't a concern for it's use in Solr though, since the generation of a cache value will be much slower than clearing the cache. There is another number which we have ignored. If the cleanup is done in a separate thread , FastLRUCache consistently outperforms the legacy one. (shalin forgot to put the numbers). For a very large cache size , the cleanup takes ~200-300 ms. Which means a request can end up paying a huge price.
We need to add a new 'newCleanupThread' option to FastLRUCache (it is there in my old patch). I guess with that we can make FastLRUcache the default with newCleanupThread=true.
If you apply yonik's latest patch on trunk you get two extra files . You can straightaway copy those two files to 1.3 and use it. I ran it again with a new thread for each cleanup. Time taken for markAndSweep is printed:
time=1550 impl=LRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835225 MarkAndSweepTime = 138 MarkAndSweepTime = 35 MarkAndSweepTime = 36 MarkAndSweepTime = 39 MarkAndSweepTime = 41 MarkAndSweepTime = 42 MarkAndSweepTime = 42 MarkAndSweepTime = 44 MarkAndSweepTime = 43 MarkAndSweepTime = 43 MarkAndSweepTime = 44 MarkAndSweepTime = 43 MarkAndSweepTime = 43 MarkAndSweepTime = 44 MarkAndSweepTime = 43 MarkAndSweepTime = 44 MarkAndSweepTime = 44 MarkAndSweepTime = 43 time=1378 impl=FastLRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8130459999999999 time=3942 impl=LRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835045 MarkAndSweepTime = 58 MarkAndSweepTime = 165 MarkAndSweepTime = 32 MarkAndSweepTime = 34 MarkAndSweepTime = 37 MarkAndSweepTime = 37 MarkAndSweepTime = 46 MarkAndSweepTime = 40 MarkAndSweepTime = 61 MarkAndSweepTime = 53 MarkAndSweepTime = 51 MarkAndSweepTime = 44 MarkAndSweepTime = 47 MarkAndSweepTime = 47 MarkAndSweepTime = 48 MarkAndSweepTime = 48 MarkAndSweepTime = 46 time=1062 impl=FastLRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8560415 I just committed a fix to the setting of acceptableSize - it was always being set to maxSize, which would normally cause the cleaning routine to return after the first phase (and thus be called more often than normal).
added a new boolean attribute newCleanThread . Default is set to false
Yonik, what do you think about using a new cleanup thread?
The ability to use a separate cleanup thread is interesting... but I'm not sure that having the ability to spin off a new thread for each cleanup is something one would ever want to do. The cleanup thread logic should probably be fixed too (no sleeping and polling... it should wait until notified that a cleanup is needed)
OK that is a good idea. But this is an important functinality.
Yonik, not trying to be pushy but can this patch be committed?
I want to create a build for internal use out of trunk with this feature. Does this have a thread leak? Where is FastLRUCache.destroy() ever called?
It's called from the finalizer (yuck), but that finalizer will never be called because the cleaning thread references the cache (the definition of liveness). Issues with having the cache deal with the thread lifecycle is why I previously recommended exploring the use of an Executor that the user passes in. Yonik, Nice catch . There was a thread leak.
I hope this patch fixes that. The cleanup thread now holds a WeakReference to the cache Thanks Noble, looks like that solution should work.
Funny thing with the latest patch though - I get compile errors with "ant test" from the command line: compile-common:
[mkdir] Created dir: f:\code\solr\build\common
[javac] Compiling 39 source files to f:\code\solr\build\common
[javac] f:\code\solr\src\java\org\apache\solr\common\util\ConcurrentLRUCache.java:201: generic array creation
[javac] CacheEntry[] eset = new CacheEntry[sz];
[javac] ^
[javac] f:\code\solr\src\java\org\apache\solr\common\util\ConcurrentLRUCache.java:379: non-static class org.apache.solr.common.util.ConcurrentLRUCache.Cache
Entry cannot be referenced from a static context
[javac] return ((CacheEntry)b).lastAccessedCopy < ((CacheEntry)a).lastAccessedCopy;
[...]
It looks like the compiler thinks that the method is static. IntelliJ doesn't flag any errors, and I can't see anything wrong after a quick glance at the code. Does "ant test" from the command line work for you? OK, committed latest patch after some minor logic changes (including changing CacheEntry from an inner class to a static inner class, which solved the compilation errors).
I think the default should remain cleanupThread=false. It's simpler behavior, and for normal cache sizes, the pause an individual request may see is less than what one would see from a GC pause.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||