Issue Details (XML | Word | Printable)

Key: SOLR-667
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Yonik Seeley
Reporter: Noble Paul
Votes: 2
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
Solr

Alternate LRUCache implementation

Created: 30/Jul/08 06:11 AM   Updated: 13/Aug/09 02:10 PM
Return to search
Component/s: search
Affects Version/s: 1.3
Fix Version/s: 1.4

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File Licensed for inclusion in ASF works ConcurrentLRUCache.java 2008-09-26 09:28 PM Yonik Seeley 9 kB
Java Source File Licensed for inclusion in ASF works ConcurrentLRUCache.java 2008-08-01 05:21 AM Noble Paul 4 kB
Java Source File Licensed for inclusion in ASF works ConcurrentLRUCache.java 2008-07-30 06:15 AM Noble Paul 4 kB
Text File Licensed for inclusion in ASF works SOLR-667-alternate.patch 2008-11-01 07:20 PM Yonik Seeley 21 kB
Text File Licensed for inclusion in ASF works SOLR-667-alternate.patch 2008-10-31 06:04 PM Yonik Seeley 16 kB
Text File Licensed for inclusion in ASF works SOLR-667-updates.patch 2008-10-30 02:27 PM Yonik Seeley 3 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-11-23 04:22 PM Noble Paul 9 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-11-19 06:26 AM Noble Paul 7 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-11-17 04:17 AM Noble Paul 10 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-11-09 04:23 PM Noble Paul 1 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-10-31 08:15 AM Noble Paul 6 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-10-31 06:13 AM Noble Paul 6 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-10-29 05:54 PM Noble Paul 1 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-10-28 12:11 PM Shalin Shekhar Mangar 19 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-10-06 07:29 AM Noble Paul 17 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-09-25 04:39 AM Noble Paul 17 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-08-19 12:42 PM Noble Paul 16 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-08-12 02:15 PM Noble Paul 13 kB
Text File Licensed for inclusion in ASF works SOLR-667.patch 2008-08-12 02:13 PM Noble Paul 13 kB

Resolution Date: 28/Oct/08 08:14 PM


 Description  « Hide
The only available SolrCache i.e LRUCache is based on LinkedHashMap which has get() also synchronized. This can cause severe bottlenecks for faceted search. Any alternate implementation which can be faster/better must be considered.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Noble Paul added a comment - 30/Jul/08 06:15 AM - edited
A POC implementation based on ConcurrentHashMap
  • Gets are free
  • Puts are free till it touches the high water mark . It is expensive (very expensive) after the high watermark .
  • To lighten the load on put an extra thread is employed to do a concurrent mark and sweep
  • there is a high-water-mark and a low-water-mark . The extra thread cleans anything if low-water-mark is crossed. Put must removes if level crosses high-water-mark

Noble Paul added a comment - 31/Jul/08 02:56 PM
If somebody can review the implementation we can add another cache implementation to Solr.

Yonik Seeley added a comment - 31/Jul/08 03:17 PM

Gets are free

Not entirely... there are a few memory barriers that make things thread safe, so I wouldn't call it "free" (since this branched off of another issue where some had the idea that one could get away without any sort of locks or memory barriers).

It's a good approach in genera, and should scale better with many CPUs under very high lookup load. But I'm not sure that it should use a separate cleaner thread... and if it does, I don't think it should be scheduled.

After we got those details worked out, then we'd need a SolrCache implementation that uses it. Given where we are in the release cycle (and that the cache contention issue is only affecting 1 person that I've seen), I think this should want until after 1.3


Fuad Efendi added a comment - 31/Jul/08 04:27 PM

...safety, where nothing bad ever happens to an object.

When SOLR adds object to cache or remove it from cache it does not change it, it manipulates with internal arrays of pointers to objects (which are probably atomic, but I don't know such JVM & GC internals in-depth...)

Looks heavy with TreeSet...


Noble Paul added a comment - 31/Jul/08 05:35 PM

It's a good approach in genera, and should scale better with many CPUs under very high lookup load. But I'm not sure that it should use a separate cleaner thread... and if it does, I don't think it should be scheduled.

Thanks for the comments. I agree with you . I devised this approach because some user reported that heavy cache lookups are slowing things down for him. The cost benefit analysis is yet to be studied. Separate cleaner thread is optional in the implementation . I am yet to study the cost of sorting over a hundred thousand entries. Do you recommend that the cleaner thread just keep running forever? That is fine, so there should be a sleep for the thread?

BTW is the executorservice very expensive?

I did not provide a SolrCache implementation because that is going to be drown the approach in details.

I think this should want until after 1.3

True. This is not marked for 1.3. But this can definitely live as a patch and anyone who needs it would benefit from it


Noble Paul added a comment - 31/Jul/08 05:46 PM

Fuad. You cannot trivialize concurrent programming so easily. Whatever
we have commented are from our experience (and wisdom) . There is a
price to pay it. Java could have easily eliminated the
java.util.concurrent package by using 'volatile' everywhere and no
need of AtomicInteger,AtomimcLong etc. So they are there for a reason

BTW. Using TreeSet is not 'heavy' . It is the right tool for right
purpose. If you need a sorted set that is best

On Thu, Jul 31, 2008 at 9:58 PM, Fuad Efendi (JIRA) <jira@apache.org> wrote:


--Noble Paul


Fuad Efendi added a comment - 31/Jul/08 06:40 PM
Paul,

I have never ever suggested to use 'volatile' 'to avoid synchronization' for concurrent programming. I only noticed some extremely stupid code where SOLR uses _double_synchronization and AtomicLong inside:

public synchronized Object put(Object key, Object value) {
    if (state == State.LIVE) {
      stats.inserts.incrementAndGet();
    }

    synchronized (map) {
      // increment local inserts regardless of state???
      // it does make it more consistent with the current size...
      inserts++;
      return map.put(key,value);
    }
  }

Each tool has an area of applicability, and even ConcurrentHashMap just slightly intersects with SOLR needs; SOLR does not need 'consistent view at a point in time' on cached objects.

'volatile' is part of Java Specs, and implemented differently by different vendors. I use volatile (instead of more expensive AtomicLong) only and only to prevent JVM HotSpot Optimizer from some not-applicable staff...


Yonik Seeley added a comment - 31/Jul/08 07:03 PM

I only noticed some extremely stupid code where SOLR uses _double_synchronization and AtomicLong inside:

A simple typo I think... a remnant from way back when changing what object was being synchronized on. That's why I like explicit synchronization rather than adding it to a method signature (easier to miss). I just fixed this to be

public Object put(Object key, Object value) {
    synchronized (map) {
      if (state == State.LIVE) {
        stats.inserts.incrementAndGet();
      }

      // increment local inserts regardless of state???
      // it does make it more consistent with the current size...
      inserts++;
      return map.put(key,value);
    }
  }

Fuad Efendi added a comment - 31/Jul/08 07:28 PM
Thanks Yonik, I even guess that in some cases synchronization is faster than sun.misc.Unsafe.compareAndSwapLong(this, valueOffset, expect, update);
public final long incrementAndGet() {
        for (;;) {
            long current = get();
            long next = current + 1;
            if (compareAndSet(current, next))
                return next;
        }
    }
  • extremal level of safety with some level of concurrency... Do we need exact value for 'stats.inserts' (if it is not synchronized)?

It can be 'long' inside synchronized block...


Yonik Seeley added a comment - 31/Jul/08 07:36 PM
If you don't need the synchronized block, then an atomic variable for "inserts" (for example) would be a big win.
But if you have the synchronized block anyway, it's probably faster to just expand it's scope if the operations to be done are simple.

Noble Paul added a comment - 01/Aug/08 05:21 AM
my findings on a simple perf test with no contention (single thread)

The code is there in the main()

cache size 1 million

with Hashmap

  • time taken for 1 million inserts = 2019ms
  • time taken for 1 million gets = 625
  • time taken for cleanup = 345ms

with ConcurrenthashHashMap

  • time taken for 1 million inserts = 2437(roughly 20% slower than hashmap but small in absolute numbers)
  • time taken for 1 million gets = 393ms (actually faster than simple HashMap )
  • time taken for cleanup = 298ms (actually faster)

other observations
The extra thread may not be be necessary . The unlucky put() may take around .25 secs to .5secs for a cache size of 1 million .
If we keep the value of (highHaterMark -lowWaterMark) value very high cleanups will be infrequent


Noble Paul added a comment - 12/Aug/08 11:30 AM
Full SolrCache implementation
not tested

the initialization parameters are same us the current LRUCache.


Noble Paul added a comment - 12/Aug/08 02:13 PM
this is the right patch

Noble Paul added a comment - 19/Aug/08 12:42 PM
with testcase

Antony Bowesman added a comment - 02/Sep/08 07:05 AM
I have also been considering a concurrent LRU cache for my own application and seeing this isse made me think about it again. Wouldn't one option be to use a ReentrantReadWriteLock to synchronise the map rather than complete synchronisation on the map for both readers and writers. Although that does not give a free get() it would at least allow concurrent get and still be able to use the LinkedHashMap and would not require the extra thread. Not sure if SOLR is java 1.5, but if not you could still use Doug Lea's concurrent package for pre 1.5 Java.

Noble Paul added a comment - 02/Sep/08 08:00 AM - edited
The patch contains a implementation which uses the java 5 features (ConcurrentHashMap) .It is better than using a separate Lock

Noble Paul added a comment - 25/Sep/08 04:39 AM
name change and some refactoring

Yonik Seeley added a comment - 26/Sep/08 09:28 PM
Here is a prototype of an idea I've had for a while for an efficient concurrent LRU cache.
It is completely untested... consider it more "code as design". It should feature faster cleaning - O( n ) when it works well.

In addition to low and high water marks, it adds the concept of an "acceptable" water mark. A cleaning phase will try to go to the low water mark, but will be considered successful if it hits the acceptable water mark.

This is coupled with a multi-pass cleaning phase. From the comments:

// if we want to keep at least 1000 entries, then timestamps of
    // current through current-1000 are guaranteed not to be the oldest!
    // Also, if we want to remove 500 entries, then
    // oldestEntry through oldestEntry+500 are guaranteed to be
    // removed.

The oldestEntry and newestEntry in the set of entries currently being considered is recorded for each phase. Entries that are new enough such that they are guaranteed to be kept are immediately removed from consideration, and entries that are old enough such that they are guaranteed to be removed are immediately removed (no sorting necessary). After 2 phases of this (configurable) and we still haven't removed enough entries, a priority queue is used to find the oldest entries out of those remaining.

There are undoubtedly some other tricks we can use, but this was the best I could come up with for now.


Noble Paul added a comment - 27/Sep/08 06:50 AM
Looks good. This has a lot in common with my approach. The doClean() is done far more efficiently in your implementation . I can improve mine with your cleanup code (if you think it is fine)

Yonik Seeley added a comment - 27/Sep/08 03:17 PM

I can improve mine with your cleanup code (if you think it is fine)

+1

I'd also include the manual tracking of size() that mine did... the ConcurrentHashMap.size() doesn't look fast.

Another thing to think about : pass an optional Executor in the constructor instead of creating a cleaning thread... and if it's null, it means "do it in the foreground". That would add flexibility and the ability to avoid one thread per cache if desired.


Fuad Efendi added a comment - 28/Sep/08 01:36 PM
Paul, Yonik, thanks for your efforts; BTW 'Concurrent'HashMap uses spinloops for 'safe' updates in order to avoid synchronization (instead of giving up CPU cycles); there are always cases when it is not faster that simple HashMap with synchronization.

LingPipe uses different approach, see last comment at SOLR-665.

Also, why are you in-a-loop with LRU? LFU is logically better.

+1 and thanks for sharing.


Noble Paul added a comment - 28/Sep/08 03:06 PM
My implementation just uses a Map<K,V> internally. If we can get a Map implemenatation that is faster than ConcurrentHashMap (and concurrent) we can replace it (after seeing the performance).

Noble Paul added a comment - 06/Oct/08 07:29 AM
borrowed some ideas from yonik's impl.
Probably this is a good enough first cut.

Shalin Shekhar Mangar added a comment - 15/Oct/08 10:32 AM
Yonik – do you think this is good enough to go in now? Probably some users can try it out and report their experiences if we commit it early. I can take this up if you want.

Shalin Shekhar Mangar added a comment - 28/Oct/08 12:11 PM
  1. Added comments in the code
  2. Fixed a few concurrency issues

I'll commit this shortly.


Shalin Shekhar Mangar added a comment - 28/Oct/08 08:14 PM
Committed revision 708656.

Thanks Fuad, Noble and Yonik!


Todd Feak added a comment - 29/Oct/08 03:45 PM
Huge thanks on this one. This was one of the bottlenecks I've seen previously. For apples to apples load tests, this more then doubled my overall throughput.

I do notice a sort of "pulsing" in the responses. It appears that everything flies along, but on occasion everything piles up for a second, then starts going again. This leads to a few response times that are over 1 second, but the average is way down closer to 20ms. Is this the cleanup involved when hitting a high-water mark?

Overall, it's a huge improvement.


Noble Paul added a comment - 29/Oct/08 05:31 PM

This leads to a few response times that are over 1 second, but the average is way down closer to 20ms.

yeah you are right.
there is one more feature in ConcurrentLRUCache whcih enables cleanups to be done in a new thread .FastLRUCache is not using it yet .I'll give a patch soon . This will take care of that 1 sec delay.


Shalin Shekhar Mangar added a comment - 29/Oct/08 05:31 PM

For apples to apples load tests, this more then doubled my overall throughput.

Very good to hear that!

Is this the cleanup involved when hitting a high-water mark?

Yes, a cleanup is attempted when the size crosses the high watermark ('size' config parameter). It is done in two stages. In the first stage, least recently used items are evicted. If, after the first stage, the cache size is still greater than 'acceptableSize' config parameter (default is 0.95*maxSize), the second stage takes over. The second stage is more intensive and tries to bring down the cache size to the 'minSize' config parameter (default is 0.9*maxSize).

Note that the cleanup is done in the same thread which calls a put on the cache, hence the 'pulsing' that you are seeing. The cache implementation supports using a separate cleanup thread too, however it is not used currently. We still need to evaluate the best way to use it and how much it can help.


Noble Paul added a comment - 29/Oct/08 05:54 PM
run cleanup in new thread

Todd Feak added a comment - 29/Oct/08 06:58 PM
I'm not sure if that helped out. I haven't run a profiler yet ,but I think the "pulsing" (for lack of a better term) is caused by something else.

Here's why... I used the new FastLRUCache only for my Document cache in my latest test. The document cache size is large enough to hold all of the documents in the test set, helping focus on cache behavior. The warming query is enough to get all documents into the cache on startup. So, the cache is essentially as full as it's gonna get. 100% hit rate, and not growing. Yet, it still exhibits this pulsing. Could it be associated with the overhead of maintaining the least recently used entries?

The introduction of the background thread didn't address this. It also didn't appear to speed things up, in fact it dropped overall throughput a bit, though still better then they old LRUCache.


Noble Paul added a comment - 30/Oct/08 03:49 AM

Could it be associated with the overhead of maintaining the least recently used entries?

The overhead is ~= 0. It just has to increment an AtomicLong everytime you do a get() .I suspect the 'pulsing' may be because of GC pauses. enable GC logging and you will know


Yonik Seeley added a comment - 30/Oct/08 02:01 PM
Todd, what's the used size of your Document cache?
I'll review this latest incarnation of ConcurrentLRUCache to see if there's anything that might cause this.

Yonik Seeley added a comment - 30/Oct/08 02:27 PM
Some minor updates:
  • fix thread saftey issue in finalizer (background thread may never see stop being set)
  • fix tracking of size in put() to only increment if oldValue != null
  • optimize cleaning check in put() since it will be called for every put until the size is back down.

Yonik Seeley added a comment - 30/Oct/08 03:00 PM
Todd: have you verified that the hit rate stays at 100% (no new inserts, no new evictions, etc)? If so, it might just be GC as Noble suggests. Bigger heaps often increase GC pauses.

Todd Feak added a comment - 30/Oct/08 03:10 PM
I'll hook up profiling and see if my GC overhead has changed, or is seeing big peaks. Should be later today.

My document cache is only 4000 for that test, but only about 3300 documents are in it. I wanted to focus on the overhead of getting things out of LRUCache, as in production we get >97% hit rates. I verified it's at 100% hit rate (warming fills it).


Shalin Shekhar Mangar added a comment - 30/Oct/08 04:07 PM
Committed revision 709188.

Thanks Yonik!


Yonik Seeley added a comment - 30/Oct/08 04:34 PM
The markAndSweep logic in the current code didn't replicate the logic I gave in my example ConcurrentLRUCache, and it can remove a lot more than it should.

Specifically, my algorithm broke the results into 3 groups:

  • those documents that are guaranteed to be in the top group (most recently accessed)
  • those documents guaranteed to be in the bottom group (immediately discard)
  • those documents where you can't tell.

The current code reversed this logic, assuming that one can remove everything that is not in the top group. This isn't valid though, as lastAccess isn't uniform (and thus the size of the top group could be as small as 1).


Todd Feak added a comment - 30/Oct/08 08:38 PM
I ran with a profiler and I'm not seeing any bursts of garbage collection. It's at a steady ~2%, with no major collections occurring (which is great!). However, the use of the profiler also slows things down about 10-20 % which seems to be enough that the pulsing goes away. I believe the pulsing may be some sort of limit of the testing I'm doing on a limited local environment. I'll include the patch in my current Solr app and do a more accurate comparison with our production level load tests to see if it still exists. Though it may be a while before I can provide feedback from that one, as getting machines allocated for the heavy load testing can take a bit.

As I said before, this is a huge improvement. Thanks for all the work on this one.


Yonik Seeley added a comment - 30/Oct/08 08:44 PM
I'll take a crack at updating the patch such that too many entries aren't removed.

Noble Paul added a comment - 31/Oct/08 01:47 AM

The current code reversed this logic,

I missed the point . I guess this post made it clear . As you mentioned it should be a 3 step cleanup


Shalin Shekhar Mangar added a comment - 31/Oct/08 05:35 AM
I forgot to add license headers to the three source files for this issue. I'll hold off adding them lest it breaks patches that you guys are working on.

Noble Paul added a comment - 31/Oct/08 06:13 AM
yonik's suggestions implemented

Noble Paul added a comment - 31/Oct/08 08:15 AM
creating an array[map.size()] for every markAndSweep() is expensive. Iterator should be better

Yonik Seeley added a comment - 31/Oct/08 03:45 PM
Thanks Noble, we had a mid-air implementation collision
I'm doing some quick performance testing the version I wrote now...I'll try it against your version after and then we can go from there.

Yonik Seeley added a comment - 31/Oct/08 06:04 PM
OK, here's the implementation based on my previous pseudo code, along with a very quick performance test.
The test uses random keys over a slightly bigger range than the table.
It also uses an upper water mark 10% higher than the lower water mark, and an acceptable water mark half way inbetween. I haven't experimented to see what the best acceptable water mark is for either impl. If anyone wants to do a more realistic test with real queries, they are welcome to it.

I did 4 runs and took the lowest number for each sub-test. Java6 -server, WinXP, P4. Times in milliseconds.

doPerfTest(2000000, 100000, 200000); // big cache
noble=17063  yonik=9157
    doPerfTest(2000000, 100000, 120000);  // smaller key space increases distance between 
oldest, newest and makes the first passes less effective.
noble=8391  yonik=5812
    doPerfTest(6000000, 1000, 2000);    // small cache, smaller hit rate
noble=17578  yonik=12515
    doPerfTest(6000000, 1000, 1200);    // small cache, bigger hit rate
noble=11500  yonik=8219

Yonik Seeley added a comment - 31/Oct/08 06:21 PM
Nobe, your latest patch contains code like this:
if (!markAndSweepLock.tryLock()) return;
    long oldestItem = this.oldestItem;
    [...]
    markAndSweepLock.unlock();
    this.oldestItem = oldestItem;

oldestItem isn't volatile (and doesn't need to be if accessed correctly).
The lock will also serve as a read barrier, so the first part is OK.
The unlock will serve as a write barrier, but the set of oldestItem comes after it (not OK... another thread may not see the value).
Changing the order (the write to oldestItem before the unlock) will ensure that the next thread that crosses a read barrier will see the new value.


Yonik Seeley added a comment - 01/Nov/08 07:20 PM
Added some minor changes, making sure that minLimit >= 1 and limit >minLimit (needed for rounding with small cache sizes).
Also added test code for LRUCache vs FastLRUCache.
It appears that LRUCache is faster (at least on my single proc PC) when the hit ratio is low, and FastLRUCache is faster when the hit ratio is high.
Should FastLRUCache be made the default in the example schema for the filterCache?
time=2937 impl=LRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=2266 impl=FastLRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=3594 impl=LRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816075
time=1484 impl=FastLRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=3203 impl=LRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835225
time=4593 impl=FastLRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.751506
time=3781 impl=LRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.834685
time=2656 impl=FastLRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8232835000000001
time=3234 impl=LRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.523398
time=5047 impl=FastLRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.3831675
time=4125 impl=LRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.511871
time=3969 impl=FastLRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.6665975
time=3390 impl=LRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.1445725
time=5687 impl=FastLRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10041049999999996
time=4750 impl=LRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10340150000000004
time=6875 impl=FastLRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.22233749999999997
time=1343 impl=LRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=860 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=1547 impl=LRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=703 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=1610 impl=LRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.833648
time=2406 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.7404839999999999
time=2078 impl=LRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8334255
time=859 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.998974
time=1922 impl=LRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5003285
time=2875 impl=FastLRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.3516785
time=2422 impl=LRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5002055000000001
time=1203 impl=FastLRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.821195
time=2297 impl=LRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10054949999999996
time=2969 impl=FastLRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.05416350000000003
time=3078 impl=LRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10003499999999999
time=3000 impl=FastLRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10475299999999999

Yonik Seeley added a comment - 01/Nov/08 09:07 PM
Latest patch:
  • fixes an off-by-one (it was possible to go below minSize)
  • fixes tests to not expect a specific number of evictions.
  • makes FastLRUCache the default for filterCache in the example schema and main test schema.

I'll commit soon if there are no objections.


Noble Paul added a comment - 02/Nov/08 02:27 AM
CacheEntry[] eset = new CacheEntry[sz];
int eSize = 0;

Isn't it too expensive to create a potentially huge array every time we do a clean? (too much work for GC) .May be we do not even need it if the first loop is enough. Moreover this one thing has added more code.

I didn't use lucene PriorityQueue because if somebody wished to lift the code they can easily do so if there is no other dependency. When I posted the requirement in google collections there was a lot of interest in such a component. Can TreeSet do the trick?


Yonik Seeley added a comment - 02/Nov/08 03:33 PM - edited

Isn't it too expensive to create a potentially huge array every time we do a clean? (too much work for GC)

That's what benchmarking is for

It's a single short-lived allocation that allows us to greatly reduce the number of elements we need to evaluate on successive passes. Inserts into a TreeSet may have a higher GC cost given it's an allocation per insert.

May be we do not even need it if the first loop is enough.

Right... although in my testing, it seemed like the first loop was rarely sufficient (although the second often was).


Shalin Shekhar Mangar added a comment - 03/Nov/08 07:24 AM
Here's the performance test from the patch on a more recent machine – Intel Quad Core, RHEL 64-bit, Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_11-b03, mixed mode):
time=1456 impl=LRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=1041 impl=FastLRUCache nThreads= 1 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=3256 impl=LRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=754 impl=FastLRUCache nThreads= 2 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=1234 impl=LRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835225
time=1564 impl=FastLRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.751506
time=3728 impl=LRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835006
time=1384 impl=FastLRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.798109
time=1357 impl=LRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.523398
time=1894 impl=FastLRUCache nThreads= 1 size=100000 maxKey=200000 gets=2000000 hitRatio=0.3831675
time=4556 impl=LRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.512785
time=1514 impl=FastLRUCache nThreads= 2 size=100000 maxKey=200000 gets=2000000 hitRatio=0.4682115
time=1614 impl=LRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.1445725
time=1837 impl=FastLRUCache nThreads= 1 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10041049999999996
time=4710 impl=LRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10963999999999996
time=1816 impl=FastLRUCache nThreads= 2 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.11144399999999999
time=339 impl=LRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=292 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=2511 impl=LRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=351 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=383 impl=LRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.833648
time=580 impl=FastLRUCache nThreads= 1 size=1000 maxKey=1200 gets=2000000 hitRatio=0.7404839999999999
time=2716 impl=LRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8337875
time=805 impl=FastLRUCache nThreads= 2 size=1000 maxKey=1200 gets=2000000 hitRatio=0.79799
time=570 impl=LRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5003285
time=794 impl=FastLRUCache nThreads= 1 size=1000 maxKey=2000 gets=2000000 hitRatio=0.3516785
time=3676 impl=LRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.49959549999999997
time=1685 impl=FastLRUCache nThreads= 2 size=1000 maxKey=2000 gets=2000000 hitRatio=0.436728
time=712 impl=LRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10054949999999996
time=1022 impl=FastLRUCache nThreads= 1 size=1000 maxKey=10000 gets=2000000 hitRatio=0.05416350000000003
time=4395 impl=LRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.100526
time=2562 impl=FastLRUCache nThreads= 2 size=1000 maxKey=10000 gets=2000000 hitRatio=0.08556600000000003

With more number of threads this time (4 and 16):

time=1794 impl=LRUCache nThreads= 4 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981608
time=594 impl=FastLRUCache nThreads= 4 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816075
time=1737 impl=LRUCache nThreads= 16 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981607
time=602 impl=FastLRUCache nThreads= 16 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981602
time=2387 impl=LRUCache nThreads= 4 size=100000 maxKey=120000 gets=2000000 hitRatio=0.830956
time=866 impl=FastLRUCache nThreads= 4 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8892465
time=1793 impl=LRUCache nThreads= 16 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8274485
time=706 impl=FastLRUCache nThreads= 16 size=100000 maxKey=120000 gets=2000000 hitRatio=0.9586865
time=2233 impl=LRUCache nThreads= 4 size=100000 maxKey=200000 gets=2000000 hitRatio=0.5025255
time=1228 impl=FastLRUCache nThreads= 4 size=100000 maxKey=200000 gets=2000000 hitRatio=0.654153
time=1905 impl=LRUCache nThreads= 16 size=100000 maxKey=200000 gets=2000000 hitRatio=0.500583
time=883 impl=FastLRUCache nThreads= 16 size=100000 maxKey=200000 gets=2000000 hitRatio=0.9067965
time=5336 impl=LRUCache nThreads= 4 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10182199999999997
time=1780 impl=FastLRUCache nThreads= 4 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.25870800000000005
time=2911 impl=LRUCache nThreads= 16 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10132300000000005
time=1941 impl=FastLRUCache nThreads= 16 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.508488
time=687 impl=LRUCache nThreads= 4 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=421 impl=FastLRUCache nThreads= 4 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=782 impl=LRUCache nThreads= 16 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=452 impl=FastLRUCache nThreads= 16 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=813 impl=LRUCache nThreads= 4 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8333735
time=678 impl=FastLRUCache nThreads= 4 size=1000 maxKey=1200 gets=2000000 hitRatio=0.9988885
time=794 impl=LRUCache nThreads= 16 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8331635
time=503 impl=FastLRUCache nThreads= 16 size=1000 maxKey=1200 gets=2000000 hitRatio=0.977526
time=1554 impl=LRUCache nThreads= 4 size=1000 maxKey=2000 gets=2000000 hitRatio=0.500093
time=928 impl=FastLRUCache nThreads= 4 size=1000 maxKey=2000 gets=2000000 hitRatio=0.802332
time=1102 impl=LRUCache nThreads= 16 size=1000 maxKey=2000 gets=2000000 hitRatio=0.5002759999999999
time=566 impl=FastLRUCache nThreads= 16 size=1000 maxKey=2000 gets=2000000 hitRatio=0.954131
time=1543 impl=LRUCache nThreads= 4 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10062899999999997
time=1039 impl=FastLRUCache nThreads= 4 size=1000 maxKey=10000 gets=2000000 hitRatio=0.7582409999999999
time=1372 impl=LRUCache nThreads= 16 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10031000000000001
time=604 impl=FastLRUCache nThreads= 16 size=1000 maxKey=10000 gets=2000000 hitRatio=0.935282

Now with 8 and 32 threads:

time=2109 impl=LRUCache nThreads= 8 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816075
time=608 impl=FastLRUCache nThreads= 8 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981606
time=1502 impl=LRUCache nThreads= 32 size=100000 maxKey=100000 gets=2000000 hitRatio=0.9816045
time=648 impl=FastLRUCache nThreads= 32 size=100000 maxKey=100000 gets=2000000 hitRatio=0.981592
time=3876 impl=LRUCache nThreads= 8 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8267995
time=748 impl=FastLRUCache nThreads= 8 size=100000 maxKey=120000 gets=2000000 hitRatio=0.915961
time=2176 impl=LRUCache nThreads= 32 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8271935
time=694 impl=FastLRUCache nThreads= 32 size=100000 maxKey=120000 gets=2000000 hitRatio=0.9652565
time=2038 impl=LRUCache nThreads= 8 size=100000 maxKey=200000 gets=2000000 hitRatio=0.5005305
time=1088 impl=FastLRUCache nThreads= 8 size=100000 maxKey=200000 gets=2000000 hitRatio=0.789179
time=2147 impl=LRUCache nThreads= 32 size=100000 maxKey=200000 gets=2000000 hitRatio=0.4997505
time=884 impl=FastLRUCache nThreads= 32 size=100000 maxKey=200000 gets=2000000 hitRatio=0.926915
time=2343 impl=LRUCache nThreads= 8 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10397699999999999
time=2207 impl=FastLRUCache nThreads= 8 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.34063
time=3440 impl=LRUCache nThreads= 32 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.10123850000000001
time=2087 impl=FastLRUCache nThreads= 32 size=100000 maxKey=1000000 gets=2000000 hitRatio=0.5367375
time=909 impl=LRUCache nThreads= 8 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=443 impl=FastLRUCache nThreads= 8 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=682 impl=LRUCache nThreads= 32 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=447 impl=FastLRUCache nThreads= 32 size=1000 maxKey=1000 gets=2000000 hitRatio=0.9998065
time=1189 impl=LRUCache nThreads= 8 size=1000 maxKey=1200 gets=2000000 hitRatio=0.832726
time=605 impl=FastLRUCache nThreads= 8 size=1000 maxKey=1200 gets=2000000 hitRatio=0.919104
time=1463 impl=LRUCache nThreads= 32 size=1000 maxKey=1200 gets=2000000 hitRatio=0.8337005
time=489 impl=FastLRUCache nThreads= 32 size=1000 maxKey=1200 gets=2000000 hitRatio=0.9845845
time=1256 impl=LRUCache nThreads= 8 size=1000 maxKey=2000 gets=2000000 hitRatio=0.500149
time=678 impl=FastLRUCache nThreads= 8 size=1000 maxKey=2000 gets=2000000 hitRatio=0.907774
time=1013 impl=LRUCache nThreads= 32 size=1000 maxKey=2000 gets=2000000 hitRatio=0.49962399999999996
time=503 impl=FastLRUCache nThreads= 32 size=1000 maxKey=2000 gets=2000000 hitRatio=0.976796
time=1504 impl=LRUCache nThreads= 8 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10030550000000005
time=754 impl=FastLRUCache nThreads= 8 size=1000 maxKey=10000 gets=2000000 hitRatio=0.9151345
time=1245 impl=LRUCache nThreads= 32 size=1000 maxKey=10000 gets=2000000 hitRatio=0.10028899999999996
time=499 impl=FastLRUCache nThreads= 32 size=1000 maxKey=10000 gets=2000000 hitRatio=0.978823

When the number of threads are increased, FastLRUCache is true to its name


Yonik Seeley added a comment - 03/Nov/08 02:31 PM
That benchmark really isn't valid for a high number of threads though: notice the difference in hitRatio.
If you have many threads quickly adding items and only one thread at a time removing items, the FastLRUCache goes over it's target size and thus increases it's hitRatio, making it artificially faster.

This isn't a concern for it's use in Solr though, since the generation of a cache value will be much slower than clearing the cache.


Todd Feak added a comment - 03/Nov/08 03:48 PM
Is there an easy way to get this patched into 1.3.0?

Right now, I think I have to grab 7 patches and apply them in order. Will that give me the correct content? Is there an easier way to do this from the repository?


Noble Paul added a comment - 04/Nov/08 04:04 AM
There is another number which we have ignored. If the cleanup is done in a separate thread , FastLRUCache consistently outperforms the legacy one. (shalin forgot to put the numbers). For a very large cache size , the cleanup takes ~200-300 ms. Which means a request can end up paying a huge price.

We need to add a new 'newCleanupThread' option to FastLRUCache (it is there in my old patch). I guess with that we can make FastLRUcache the default with newCleanupThread=true.

Is there an easy way to get this patched into 1.3.0?

If you apply yonik's latest patch on trunk you get two extra files . You can straightaway copy those two files to 1.3 and use it.


Shalin Shekhar Mangar added a comment - 04/Nov/08 10:55 AM
I ran it again with a new thread for each cleanup. Time taken for markAndSweep is printed:
time=1550 impl=LRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835225
MarkAndSweepTime = 138
MarkAndSweepTime = 35
MarkAndSweepTime = 36
MarkAndSweepTime = 39
MarkAndSweepTime = 41
MarkAndSweepTime = 42
MarkAndSweepTime = 42
MarkAndSweepTime = 44
MarkAndSweepTime = 43
MarkAndSweepTime = 43
MarkAndSweepTime = 44
MarkAndSweepTime = 43
MarkAndSweepTime = 43
MarkAndSweepTime = 44
MarkAndSweepTime = 43
MarkAndSweepTime = 44
MarkAndSweepTime = 44
MarkAndSweepTime = 43
time=1378 impl=FastLRUCache nThreads= 1 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8130459999999999

time=3942 impl=LRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.835045
MarkAndSweepTime = 58
MarkAndSweepTime = 165
MarkAndSweepTime = 32
MarkAndSweepTime = 34
MarkAndSweepTime = 37
MarkAndSweepTime = 37
MarkAndSweepTime = 46
MarkAndSweepTime = 40
MarkAndSweepTime = 61
MarkAndSweepTime = 53
MarkAndSweepTime = 51
MarkAndSweepTime = 44
MarkAndSweepTime = 47
MarkAndSweepTime = 47
MarkAndSweepTime = 48
MarkAndSweepTime = 48
MarkAndSweepTime = 46
time=1062 impl=FastLRUCache nThreads= 2 size=100000 maxKey=120000 gets=2000000 hitRatio=0.8560415

Yonik Seeley added a comment - 05/Nov/08 07:35 PM
I just committed a fix to the setting of acceptableSize - it was always being set to maxSize, which would normally cause the cleaning routine to return after the first phase (and thus be called more often than normal).

Noble Paul added a comment - 09/Nov/08 04:23 PM
added a new boolean attribute newCleanThread . Default is set to false

Shalin Shekhar Mangar added a comment - 14/Nov/08 11:21 AM
Yonik, what do you think about using a new cleanup thread?

Yonik Seeley added a comment - 14/Nov/08 04:42 PM
The ability to use a separate cleanup thread is interesting... but I'm not sure that having the ability to spin off a new thread for each cleanup is something one would ever want to do. The cleanup thread logic should probably be fixed too (no sleeping and polling... it should wait until notified that a cleanup is needed)

Noble Paul added a comment - 15/Nov/08 04:54 AM
OK that is a good idea. But this is an important functinality.

Noble Paul added a comment - 17/Nov/08 04:17 AM
  • cleanup thread does wait() and get notified when needed
  • ConcurrentLRUCache is generified
  • A new interface added to ConcurrentLRUCache called EvictionListener .This gets callback for each entry that is evicted
  • FastLRUcache has a new configuration 'cleanupThread' . default is set to 'false' . I believe it should be true by default

Noble Paul added a comment - 19/Nov/08 06:26 AM
made CacheEntry non static

Shalin Shekhar Mangar added a comment - 20/Nov/08 11:43 AM
Yonik, not trying to be pushy but can this patch be committed?

I want to create a build for internal use out of trunk with this feature.


Yonik Seeley added a comment - 22/Nov/08 10:26 PM
Does this have a thread leak? Where is FastLRUCache.destroy() ever called?
It's called from the finalizer (yuck), but that finalizer will never be called because the cleaning thread references the cache (the definition of liveness). Issues with having the cache deal with the thread lifecycle is why I previously recommended exploring the use of an Executor that the user passes in.

Noble Paul added a comment - 23/Nov/08 04:22 PM
Yonik, Nice catch . There was a thread leak.

I hope this patch fixes that. The cleanup thread now holds a WeakReference to the cache
The close() of Solrcache ensures that it is destroyed.


Yonik Seeley added a comment - 23/Nov/08 05:12 PM
Thanks Noble, looks like that solution should work.

Funny thing with the latest patch though - I get compile errors with "ant test" from the command line:

compile-common:
    [mkdir] Created dir: f:\code\solr\build\common
    [javac] Compiling 39 source files to f:\code\solr\build\common
    [javac] f:\code\solr\src\java\org\apache\solr\common\util\ConcurrentLRUCache.java:201: generic array creation
    [javac]       CacheEntry[] eset = new CacheEntry[sz];
    [javac]                           ^
    [javac] f:\code\solr\src\java\org\apache\solr\common\util\ConcurrentLRUCache.java:379: non-static class org.apache.solr.common.util.ConcurrentLRUCache.Cache
Entry cannot be referenced from a static context
    [javac]       return ((CacheEntry)b).lastAccessedCopy < ((CacheEntry)a).lastAccessedCopy;
   [...]

It looks like the compiler thinks that the method is static. IntelliJ doesn't flag any errors, and I can't see anything wrong after a quick glance at the code. Does "ant test" from the command line work for you?


Yonik Seeley added a comment - 23/Nov/08 09:44 PM
OK, committed latest patch after some minor logic changes (including changing CacheEntry from an inner class to a static inner class, which solved the compilation errors).

Noble Paul added a comment - 24/Nov/08 04:30 AM
  • Should we keep the cleanupThread= true default for FastLRUCache?

Yonik Seeley added a comment - 24/Nov/08 04:42 AM
I think the default should remain cleanupThread=false. It's simpler behavior, and for normal cache sizes, the pause an individual request may see is less than what one would see from a GC pause.