|
Fuad Efendi made changes - 28/Jul/08 03:25 AM
[
Permlink
| « Hide
]
Fuad Efendi added a comment - 28/Jul/08 03:29 AM
I renamed to FIFOCache just before opening an issue; in a local system it is (modified) LRUCache so that filterCache has reference to 'old' name...
Almost forgot: I am estimating performance gains basing on real application in-production, multithreaded, Tomcat 6.0.16 & JRockit R27 (Java 6) & SLES10 & two quad-core Opteron 2350 (8 CPUs total) & 25Gb for SOLR...
And, first queries run a long, more than a minute (warming up caches with faceted query id:[* TO *]) average Time-Per-Request decreases over time and it is now 1591.5232 giving 10x performance boost. Facets are highly distributed as you can see from website and filterCache... HTTP caching is supported - to see real timing you should execute real queries... ConcurrentHashMap is not applicable - we are not modifying cached item indeed... FIFO is without 'Out' if we have enough memory. This implementation isn't thread safe (due the the removed synchronization).
Why do we have almost the entire get() method synchronized on the map in LRUCache .
stats increment can possibly be put out of synchronized
It simplified the code and reduced the number of branches. What's your proposed version? public Object get(Object key) { synchronized (map) { Object val = map.get(key); if (state == State.LIVE) { // only increment lookups and hits if we are live. lookups++; stats.lookups.incrementAndGet(); if (val!=null) { hits++; stats.hits.incrementAndGet(); } } return val; } } I guess this is safe and faster
public Object get(Object key) { synchronized (map) { Object val = map.get(key); if (state == State.LIVE) { // only increment lookups and hits if we are live. lookups++; } } stats.lookups.incrementAndGet(); if (val != null) { //hits++; this must be removed stats.hits.incrementAndGet(); } return val; } let us remove the field hits and use stats.hits wherever we need it Your version changes what the method does in a couple of respects though.
stats.hits is for all caches of this type (it's shared across searchers). hits is local to this cache instance. Regarding Thread Safety:
From JavaDoc: "Note that this implementation is not synchronized. If multiple threads access a linked hash map concurrently, and at least We should synchronize only removeEntryForKey of HashMap; unfortunately we can't do it without rewriting HashMap. Probably we can use ConcurrentHashMap as a base class of LinkedHashMap, but I don't know answer yet... I am guessing that unsynchronized entry removal won't be significant issue in multithreaded environment.
A cache is not a read-only object. Gets need to be synchronized because other threads can be changing the cache.
Since we need to synchronize anyway, it's more efficient to just stick things like hits++ inside the sync block instead of having a separate AtomicLong.
I am familiar with Doug Lea's findings (he wrote his book in 1996, and it became part of java.util.concurrent after 10(!!!) years).
Adding a new key/value pair to the cache, or removal of a key/value pair from the cache.
This would only be under very specific usage patterns - everything pre-cached, and no adds or removes once the cache is "LIVE" (accessed by more than one thread concurrently). Fuad--
I agree with Yonik here. From the HashMap Javadoc:
I already commented it. Just try to avoid 'books' and 'authors', also try to find meaning in JavaDocs and try to browse JavaSource instead...
Will never happen. We need to synchronize inserts because each insert may calculate size and remove 'eldest' entry, and we need to avoid OOMs. We need to synchronize retrieves for LRU because 'Linked' HashMap (with Access Order) will change links (will reorder Map). And we do not need to synchronize retrieves from Insertion-Ordered LinkedHashMap (FIFO). It is classic... i'd like to research more java.util.concurrent Closing, since it's pretty much impossible to avoid all forms of synchronization and memory barriers for get() on a shared Map object.
Yonik Seeley made changes - 28/Jul/08 04:22 PM
Why? We have PoC at least, and we know the bottleneck!
We need improvement: avoid some synchronization; it is extremely easy with FIFO. We may try to improve LRU too. Not everything in JAVA is extremely good: for instance, synchronization. Even for single-threaded application, it needs additionally 600 CPU cycles (I've read it somewhere for SUN Java 1.3 on Windows) Yonik, please allow some time to think / to discuss. I'll try also to provide 'concurrent' LRU; but this issue is FIFO related. get() method do not need to be synchronized for FIFO cache. Unsynchronized object retrieval is not structural modification of Insertion-Ordered LinkedHashMap. Unsynchronized cache improves performance of multithreaded applications linear to number of CPUs.
Fuad Efendi made changes - 28/Jul/08 04:50 PM
OK.Let us try to study the failure cases .
get() does no state modification. So if the get is unsynchronized the worst case scenario is that we may get an Exception (I dunno what ConcurrentModificationException? ArrayIndexOutOfBoundsException? ). How often does it happen? What is the probability? What is the big deal ?(the system won't crash). We can catch the Exception and return null as if the entry was not found . That means that Solr may have to recompute results where it did not have to (this is a cost). here we totally eliminated the cost of synchronization (benefit). I guess if you do a cost benefit analysis this does not turn out to be as bad as it is projected to be. And after all the user is knowingly using this fully aware of the cost. is there anything else I have not considered
ConcurrentModificationException is thrown only when we iterate Map and another thread modified structure; with LRU each get() modifies structure, with FIFO - only inserts...
I think Fuad has a point. In an Insertion ordered LinkedHashMap, get makes no structural modification and if we synchronize put/remove, we should be fine. The cache warming is already thread-safe and we don't have iterators anywhere. Am I missing something here?
Sent to the mailing List:
>>Depends on if you want to go by the javadoc or not - it says that if any of the threads accessing the map concurrently make a structural change, the map must be synchronized. It's clear that get does not >>make a structural change when using insertion order, but can any other threads possibly make a mapping change while calling get? Sounds like yes, and so the contract would seem to indicate you >>synchronize... >>Put can modify shared variables that get accesses - sounds like dangerous ground to me. If it works, its got to be sneaky enough to warrant code smell... >>- Mark Further: Here is just one of possibly many problems - a put call can resize and make an entirely new table array. A get call can do something like the following with the table array: table[indexFor(hash, table.length) Do to execution reordering / memory barriers, would seem to me that the table being indexed into and the table.length may not be the values you were hoping for... I haven't looked at the proposed code at all, but it is possible to design this kind of datastructure, with much care:
Yes, I'm sorry for speaking out of turn here.
OK . I take my words back. LinkedHashMap is backed a by a HashMap . The table can get resized during the lifetime of a map (and hence during a get operation also). So the get() may return a wrong value (which is unacceptable) and it may never throw an Exception.
If we ever want to have an LRUCache with a non -synchronized get() we must have it backed by a ConcurrentHashMap Mark, thanks for going deeper. Yes, resize may change ( Entry[ ] ) table, key will disappear from bucket and _get() returns null:
/**
* The table, resized as necessary. Length MUST Always be a power of two.
*/
transient Entry[] table;
...
public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}
Might get null instead of MyValue (during resize by another thread); not a big deal! It is still Thread-Safe. Will add new entry (overwrite existing) in such extremely rare cases.
Fuad, before going further on this, we must also consider users who use alternate JVM implementations. Since the contract is to synchronize, we cannot base our decision on Sun JDK's implementation.
I don't have time to look at this right now, but I'll certainly spend some time tonight Efendi.
A quick to my point though: I mentioned something that might be done in an implementation that could be an issue - you came back with a specific implementation that may (or may not, I don't have time at the moment) not work anyway. We can't program to an implementation though - we must program to the contract. resize() is only the most obvious problem... there are tons of ways this can fail, even if you get around the resize (and your "null" fix for resize() won't work). Some of the failures can be in the form of incorrect results rather than null pointers or exceptions (so you can't just retry).
I'll reiterate:
Shalin, we are already using AtomicLong of Java 5; JVM is defferent story... JRockit R27 is JVM from BEA-Oracle, and its JDK 6 powered (rt.jar comes from SUN).
I just tried to compare synchronized with unsynchronized and found it is the main problem for faceting... Another problem: somehow faceting recalculates each time (using filterCache during repeated recalculations), queryCache is not enough... We need to invite Doug Lea to this discussion...
http://en.wikipedia.org/wiki/Doug_Lea http://gee.cs.oswego.edu/dl/index.html We may simply use java.util.concurrent.locks instead of heavy synchronized... we may also use Executor framework instead of single-thread faceting... We may even base SOLR on Apache MINA project.
Fuad, I am completely with you on this. Everybody will agree that a more efficient implementation will be a very useful addition to Solr. However, Yonik's concerns on the current patch are valid and we cannot go ahead with the current one. concerns are probably because of misunderstanding of some contract...
/**
* The table, resized as necessary. Length MUST Always be a power of two.
*/
transient Entry[] table;
void resize(int newCapacity) {
Entry[] oldTable = table;
int oldCapacity = oldTable.length;
if (oldCapacity == MAXIMUM_CAPACITY) {
threshold = Integer.MAX_VALUE;
return;
}
Entry[] newTable = new Entry[newCapacity];
transfer(newTable);
table = newTable;
threshold = (int)(newCapacity * loadFactor);
}
public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}
/**
* Transfer all entries from current table to newTable.
*/
void transfer(Entry[] newTable) {
Entry[] src = table;
int newCapacity = newTable.length;
for (int j = 0; j < src.length; j++) {
Entry<K,V> e = src[j];
if (e != null) {
src[j] = null;
do {
Entry<K,V> next = e.next;
int i = indexFor(e.hash, newCapacity);
e.next = newTable[i];
newTable[i] = e;
e = next;
} while (e != null);
}
}
}
P.S. * @author Doug Lea * @author Josh Bloch * @author Arthur van Hoff * @author Neal Gafter * @version 1.65, 03/03/05 P.P.S. description: xxx Cache(maxSize=10000000, initialSize=1000) size : 2668705 cumulative_inserts : 4246246
[quote]We may simply use java.util.concurrent.locks instead of heavy synchronized... we may also use Executor framework instead of single-thread faceting... We may even base SOLR on Apache MINA project.[/quote]
Simply replacing synchronized with java.util.concurrent.locks doesn't increase performance. There needs to be a specific strategy for employing these locks in a way that makes sense. For instance, one idea would be to create a read/write lock with the put()'s covered by write and get()'s covered by read. This would allow multiple parallel reads and will be thread-safe. Another is to create something like ConcurrentLinkedHashMap. These strategies should be tested before trying to create a lock-free get() version, which if even possible, would rely deeply on the implementation (such a structure would have to be created from scratch, I believe). I'd expect anyone that is able to create such a thing be familiar enough wiht memory barriers and such issues to be able to deeply explain the problems with double-checked locking off the top of their head (and immediately see such problems in other code)
I absolutely agree... ConcurrentHashMap is based on some level of acceptable safety, for specific tasks only
We can try to design specific caches directly implementing Map or ConcurrentMap interfaces. We should define 'safety' levels (for instance, null is not a problem if cache already contains this object added by another thread concurrently; cache elements are explicitly immutable objects; and etc.) This is extremely simple Concurrent LRU, I spent an hour to create it; it is based on ConcurrentHashMap. I don't use java.util.concurrent.locks, and I am trying to focus on requirements only avoiding implementing unnecessary methods of Map interface (so that I am not following contract
import java.util.Map; import java.util.concurrent.ConcurrentHashMap; public class ConcurrentLRU<K, V> { protected ConcurrentHashMap<K, ValueWrapper<V>> map; protected int maxEntries; public ConcurrentLRU(int maxEntries) { map = new ConcurrentHashMap<K, ValueWrapper<V>>(); this.maxEntries = maxEntries; } public V put(K key, V value) { ValueWrapper<V> wrapper = map.put(key, new ValueWrapper<V>(value)); checkRemove(); return value; } void checkRemove() { if (map.size() <= maxEntries) return; Map.Entry<K, ValueWrapper<V>> eldestEntry = null; for (Map.Entry<K, ValueWrapper<V>> entry : map.entrySet()) { long popularity = entry.getValue().popularity; if (eldestEntry == null || eldestEntry.getValue().popularity > popularity) { eldestEntry = entry; } } map.remove(eldestEntry.getKey(), eldestEntry.getValue()); } public V get(Object key) { ValueWrapper<V> wrapper = map.get(key); return wrapper == null ? null : wrapper.getValue(); } public final static class ValueWrapper<V> { static volatile long popularityCounter; volatile long popularity; V value; ValueWrapper(V value) { this.value = value; popularity = popularityCounter++; } public boolean equals(Object o) { if (!(o instanceof ValueWrapper)) { return false; } return (value == null ? ((ValueWrapper) o).value == null : value.equals(((ValueWrapper) o).value)); } public int hashCode() { return value.hashCode(); } public V getValue() { popularity = popularityCounter++; return value; } } } P.S. Fuad, you are on the right track now by using something that is thread-safe (ConcurrentHashMap). A couple of minor points:
Anyway, If this works for you, use it!
That's probably not true for modern JVMs though – cf. http://www.ibm.com/developerworks/library/j-jtp04223.html
How did you measure that exactly? The Solr admin pages will not give you exact measurements. Could you describe the test setup in detail? I'm guessing that you're caching the results of all queries in memory such that no disk access is necessary. Are you using highlighting or anything else that might be CPU-intensive at all? From my personal experience with Solr I wouldn't expect synchronization for the caches to be that big of a performance penalty. In some of my tests with a several GB index where all results where cached and highlighting was turned on I've seen throughputs in excess of 400 searches per second. I think that the performance bottleneck in this case was the network interface for sending the replies.
Consider the following case: thread A performs a synchronized put, thread B performs an unsynchronized get on the same key. B gets scheduled before A completes, the returned value will be undefined. Yes, we could do sanity checks to minimise these cases, but that would probably end up being more expensive than the synchronization.
one of the threads modifies the map structurally, it must be synchronized externally." That's exactly the case here – the update thread modifies the map structurally! It doesn't do this at all times, probably even never after the cache has been populated, but there's no way to know for sure unless you explicitely remove the put method. I'm not convinced that we should change the current implementation for the following reasons:
That said, if you can show conclusively (e.g. with a profiler) that the synchronized access is indeed the bottleneck and incurs a heavy penalty on performance, then I'm all for investigating this further.
Yes, and I do not need exact measurements! It gives me averageTimePerRequest which improved almost 10 times on production server. Should I right JUnit tests and execute it in a single-threaded environment? Better is to use The Grinder, but I don't have time and spare CPUs.
But 'searches per second' is not the same as 'average response time'!!!
Yes, I am using highlighting. You can see it at http://www.tokenizer.org
But this is another bug of SOLR!!! I am using extremely large caches but SOLR still recalculates facet intersections. - SOLR-669
You probably mean 'LinkedList traversal' but this is not the case. This is why we need to browse JavaSource... LinkedHashMap extends HashMap and there is no any 'traversal', public V get(Object key) { Entry<K,V> e = (Entry<K,V>)getEntry(key); if (e == null) return null; e.recordAccess(this); return e.value; }
the returned value is well defined: it is either null or correct value.
Who cares? We are not iterating the map!
What?!! Anyway, I believe simplified ConcurrentLRU backed by ConcurrentHashMap is easier to understand and troubleshoot...
No, it is not a bug. it is virtually "checkpoint", like as a timer, one timer for all instances. We can use System.currentTimeMillis() instead, but static volatile long is faster. About specific use case: yes... if someone has 0.5 seconds response time for faceted queries I am very happy... I had 15 seconds before going with FIFO.
but get() method of LinkedHashMap reorders whole map!!! (Of course, CPU load is evenly distributed between several get() so that we can't see it) Other implementations even use Arrays.sort() or something similar. I don't see easier solution than that... probably some random-access policy with predictable range of "popularity", we can evict anything 'old' and not necessarily 'eldest'...
No it doesn't... think linked-list. It moves a single item, which is pretty fast. Lars, I used FIFO because it is extremely simple to get unsynchronized get():
map = new LinkedHashMap(initialSize, 0.75f, true) - LRU Cache (and we need synchronized get()) map = new LinkedHashMap(initialSize, 0.75f, false) - FIFO (and we do not need synchronized get()) Yonik, I'll try to improve ConcurrentLRU and to share findings... of course FIFO is not what we need.
yes, so I wrote 'evenly distributed between several get() so we can't see it' - it keeps List ordered and we can't unsynchronize it with all subsequences!!!
It is if nothing is modifying the map during the get. If something is modifying the map you don't know how the implementation handles the insert of a new value. It might copy the object, and you'd end up with half an object or even an invalid memory location. That's why the javadoc says that you must synchronize accesses if anything modifies the map – this is not limited to iterators.
No, it is not! Your analysis seems to ignore the java memory model (partially constructed objects and all that). I don't know how many different ways to say it.... please do yourself a favor and read up on the java memory model (and the book I previously referenced is great for this). This is hard stuff (at the lowest levels).
JavaDoc does not say that. Instead, it says (I am repeating):
I know exactly: SOLR does not modify 'value' during 'insert', Map.Entry instances are immutable in SOLR, etc. Table resize is main problem - but after analyzing source code I don't see any problem. Consern that 'wrong value will be returned for a key' is not applicable. And JavaDocs does not say anything about that. Collections internally use Map.Entry in an immutable way, do not change it.
Ok. May be I can get reference to wrong object type, or even object scheduled for finalization... But we are not inserting into Map 'partially constructed objects', isn't it? Simplest scenario: Thread A tries to get variable (4 bytes of address in JVM) pointing to object O. Another thread B concurrently assigns null to that variable. Isn't it solved at CPU level yet? Or, may be on 64bit system thread B assigns zero to first 2 bytes, and then to another 2 bytes? I need to study this book... BTW, I am running JVM with '-server' option. The way I understand the javadoc is
The it references the map. It says explicitely that the map must be synchronized, i.e. all operations on it. It does not say that only the thread modifying it must be synchronized. Consider the case where only one thread modifies the map. You're saying that in this case synchronization would not be necessary, as only modifications themselves need to be synchronized and there is only one thread doing that. The javadoc explicitely says that it is necessary.
BTW, Joshua Bloch developed Arrays.sort(), and bug was found after 9 years. Nothing is perfect. ConcurrentLRU looks extremely simple and easy to improve. Should we check SUN's bug database before using ConcurrentHashMap? It has some related... Just a quick comment from the Peanut Gallery.
Instead of arguing about what the semantics of what the JavaDocs for java.util.Map and friends actually say, and worrying about whether particular implementations actually follow the rules but provide reasonable performance, this seems like a good opportunity to design and build an application specific data structure that has the behavioral characteristics you want (Fuad is after fastest-possible reads, everybody is after reasonable behavior in the face of concurrent writes). My only recommendation in this regard would be this: don't make a custom implementation say "implements java.util.LinkedHashMap" (again, or whatever implementation is currently in use) unless it does actually implement the documented semantics for java.util.LinkedHashMap. It's fine to have a custom FifoCacheMap (or whatever name you like) class that does not implement java.util.Map. It's not fine to say you implement an interface but then break the documented contract.
Right. Any solr cache must currently implement the SolrCache interface, and LinkedHashMap is simply an implementation detail of the LRUCache implementation of SolrCache (which has no relationship to the Map interface).
Thanks and sorry for runtime errors; FIFO looks strange at first, but... for large cache (100000 items), most popular item can be mistakenly removed... but I don't think there are any 'most popular facets' etc.; it's evenly distributed in most cases. Another issue: SOLR always tries recalculate facets even with extremely large filterCache & queryResultCache, even the same faceted query shows always the same long response times.
I agree of course... However, we are not dealing with unknown implementation of java.util.Map clonig (java.lang.Cloneable) objects somehow or using some weird object introspection etc.... BTW there is almost no any functional difference between LRU and FIFO. And there is huge difference between LRU (Least Recently Used) and LFU (Least Frequently Used).
It's easy to implement ConcurrentLFU based on provided ConcurrentLRU template; of course, following the main contract org.apache.solr.search.SolrCache. I Fuad. This is a worthwhile effort to undertake. Why don't you implement a LRUCache implementation and post it (as a java file) . Let everyone review and see if it can be improved
Paul, I want to do it... understand me, since February I am having constant performance problems with faceted queries (15-20 seconds response time), I ordered new server with 32 Gb & 2x quad-core (8x times more power!) but it didn't improve performance; finally I commented sync in LRUCache and made it FIFO... I was very impatient with this post, just tried to share very real staff...
A FIFOCache using ConcurrentHashMap.
Please give feedback
Noble Paul made changes - 29/Jul/08 06:26 AM
Another one this is LRU. uses ConcurrentHashMap again
Noble Paul made changes - 29/Jul/08 09:34 AM
Noble Paul made changes - 29/Jul/08 02:42 PM
Thanks Paul, I am always trying to simplify... AtomicReference is a pointer to (approximately) least recently used 'key'...
Fuad Efendi made changes - 30/Jul/08 04:32 AM
Fuad Efendi made changes - 30/Jul/08 04:40 AM
another bug... and AtomicReference is generic... never used it before. Could be even 'weaker' if we use 'hashcode' which is long (and atomic) instead of Key which is object (and unsafe), and distribution of hashcode is ok...
Fuad Efendi made changes - 30/Jul/08 04:50 AM
There are a few obvious issues w/ your patch.
Effendi your implementation is wrong if (wrapper.lastAccessed < pointer.lastAccessed) { ref.set(new Pointer(key, wrapper.lastAccessed)); } will always evaluate to false. And the reference will always have one value In that implementation there is always a contention on the head so gets are bound to be slow. That is why i did not go that route There is no need to use a WeakReference anywhere. This is a cache. So it must maintain strong reference I have opened
Noble, thanks for feedback!
Of course my code is buggy but I only wanted to illustrate simplest idea; I am extremely busy with other staff (Liferay) and can't focus on SOLR improvements... may be during weekend.
I mean (and code too) the same; probably wrong wording
Do we really-really need thread safety here? By using 'volatile' I only prevent some JVMs from trying to optimize some code (and cause problems).
Agree...
May be... but looks similar to Arrays.sort(), or TreeSet, and etc.... I am trying to avoid this. 'No other shortcut' - may be, but I am unsure. Thanks!
Fuad Efendi made changes - 30/Jul/08 02:36 PM
Fuad Efendi made changes - 30/Jul/08 02:44 PM
Fuad Efendi made changes - 30/Jul/08 02:45 PM
Fuad Efendi made changes - 30/Jul/08 02:52 PM
Fuad Efendi made changes - 30/Jul/08 02:52 PM
Fuad Efendi made changes - 30/Jul/08 03:05 PM
Fuad Efendi made changes - 30/Jul/08 03:05 PM
I don't think ConcurrentHashMap will improve performance, and ConcurrentMap is not what SOLR needs:
V putIfAbsent(K key, V value);
V replace(K key, V value);
boolean replace(K key, V oldValue, V newValue);
There is also some(...) overhead with oldValue and the state of the hash table at some point; additional memory requirements; etc... can we design something plain-simpler being focused on SOLR specific requirements? Without all functionality of Map etc... Guys at LingPipe (Natural Language Processing) http://alias-i.com/
http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html http://alias-i.com/lingpipe/docs/api/com/aliasi/util/HardFastCache.html closing as duplicate since we have a concurrent cache impl now.
Yonik Seeley made changes - 21/Feb/09 03:31 PM
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||