Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      By using our own interned string pool on top of default, String.intern() can be greatly optimized.

      On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)'
      For java 5 and 4 speedup is lower, but still considerable.

      1. intern.patch
        30 kB
        Earwin Burrfoot
      2. LUCENE-1607.patch
        2 kB
        Noble Paul
      3. LUCENE-1607.patch
        10 kB
        Yonik Seeley
      4. LUCENE-1607.patch
        6 kB
        Yonik Seeley
      5. LUCENE-1607.patch
        5 kB
        Yonik Seeley
      6. LUCENE-1607.patch
        5 kB
        Yonik Seeley
      7. LUCENE-1607.patch
        5 kB
        Yonik Seeley
      8. LUCENE-1607.patch
        36 kB
        Earwin Burrfoot
      9. LUCENE-1607.patch
        31 kB
        Earwin Burrfoot
      10. LUCENE-1607.patch
        6 kB
        Yonik Seeley
      11. LUCENE-1607.patch
        6 kB
        Yonik Seeley
      12. LUCENE-1607.patch
        5 kB
        Yonik Seeley
      13. LUCENE-1607-contrib.patch
        3 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          What was the field count? Is it still a considerable speedup with hundreds of fields without slowing anything else down ?(I would assume so, but would be nice to know considering a new hashmap is made per add - it is a one time hit though and the number of fields is not likely to exceed hundreds at the extreme)?

          Also would be great to get the speedup numbers for Java 4 and 5.

          I'll relate this to the 3 or 4 other field intern issues out there in a bit.

          Show
          Mark Miller added a comment - What was the field count? Is it still a considerable speedup with hundreds of fields without slowing anything else down ?(I would assume so, but would be nice to know considering a new hashmap is made per add - it is a one time hit though and the number of fields is not likely to exceed hundreds at the extreme)? Also would be great to get the speedup numbers for Java 4 and 5. I'll relate this to the 3 or 4 other field intern issues out there in a bit.
          Hide
          Yonik Seeley added a comment -

          Here's a completely lockless and memory barrier free intern() cache.
          This default would be more back compatible since programs may rely on String instances being interned via String.intern().

          It does not yet include corresponding Lucene code changes to use the StringInterner.

          Thoughts?

          Show
          Yonik Seeley added a comment - Here's a completely lockless and memory barrier free intern() cache. This default would be more back compatible since programs may rely on String instances being interned via String.intern(). It does not yet include corresponding Lucene code changes to use the StringInterner. Thoughts?
          Hide
          Yonik Seeley added a comment -

          Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe.

          Show
          Yonik Seeley added a comment - Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe.
          Hide
          Mark Miller added a comment -

          Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe.

          That was my first impression too, but I couldnt pin down the issue. The access will either be against the old pool, or it will be against the new pool, and the instance switch should be atomic? I figured it was a clever trick of some kind (though I did wonder about the cost of making the new hashmap every add). The HashMaps are read only right (once they can be accessed by multiple threads)? And they are popped in with an atomic variable assignment?

          Show
          Mark Miller added a comment - Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe. That was my first impression too, but I couldnt pin down the issue. The access will either be against the old pool, or it will be against the new pool, and the instance switch should be atomic? I figured it was a clever trick of some kind (though I did wonder about the cost of making the new hashmap every add). The HashMaps are read only right (once they can be accessed by multiple threads)? And they are popped in with an atomic variable assignment?
          Hide
          Earwin Burrfoot added a comment -

          This default would be more back compatible since programs may rely on String instances being interned via String.intern().

          My version is also String.intern()-compatible

          Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe.

          unlock for one thread happens-before lock on the same monitor for the other thread, inside each thread each action happens-before the next one
          Since I get pool reference for the second time after the lock, and write pool reference before unlocking, everything's fine. As for the other threads, if they find what they need in the pool, it doesn't matter if they're seeing a stale pool, or not. If they don't find what they need, they hit the lock, re-retrieve pool reference, getting the latest one and either find what they need there, or write.
          Correct me if I'm wrong?

          though I did wonder about the cost of making the new hashmap every add

          It's COSTLY But you're going to pay all of it at startup.

          I think we can introduce the class (without any interfaces, why should we need one here?), and then try to make it faster by switching storage. I tried GNU Trove THashMap, but on Java 6 it was slower than stock HashMap.

          Show
          Earwin Burrfoot added a comment - This default would be more back compatible since programs may rely on String instances being interned via String.intern(). My version is also String.intern()-compatible Earwin, I took a quick look at your implementation just now, but it doesn't look thread-safe. unlock for one thread happens-before lock on the same monitor for the other thread, inside each thread each action happens-before the next one Since I get pool reference for the second time after the lock, and write pool reference before unlocking, everything's fine. As for the other threads, if they find what they need in the pool, it doesn't matter if they're seeing a stale pool, or not. If they don't find what they need, they hit the lock, re-retrieve pool reference, getting the latest one and either find what they need there, or write. Correct me if I'm wrong? though I did wonder about the cost of making the new hashmap every add It's COSTLY But you're going to pay all of it at startup. I think we can introduce the class (without any interfaces, why should we need one here?), and then try to make it faster by switching storage. I tried GNU Trove THashMap, but on Java 6 it was slower than stock HashMap.
          Hide
          Earwin Burrfoot added a comment -

          What was the field count? Is it still a considerable speedup with hundreds of fields without slowing anything else down ?

          The field count was 1 I rewrote the benchmark, with extra code Java6 speedup on single already interned field became 12.5x.
          Results for three java varieties, and different sets of keys follow:

          Java 6 (64, server):

          1 key
          interned 12.47x
          uninterned 3.76x

          10 keys
          interned 8.03x
          uninterned 3.08x

          100 keys
          interned 6.58x
          uninterned 2.55x

          1000 keys
          interned 5.39x
          uninterned 2.69x

          Java 5 (64, server):

          1 key
          interned 9.84x
          uninterned 5.03x

          10 keys
          interned 7.00x
          uninterned 4.61x

          100 keys
          interned 6.61x
          uninterned 2.28x

          1000 keys
          interned 4.73x
          uninterned 2.73x

          Java 4 (32, client):

          1 key
          interned 4.90x
          uninterned 2.88x

          10 keys
          interned 4.08x
          uninterned 2.67x

          100 keys
          interned 3.88x
          uninterned 2.52x

          1000 keys
          interned 3.44x
          uninterned 2.31x

          Show
          Earwin Burrfoot added a comment - What was the field count? Is it still a considerable speedup with hundreds of fields without slowing anything else down ? The field count was 1 I rewrote the benchmark, with extra code Java6 speedup on single already interned field became 12.5x. Results for three java varieties, and different sets of keys follow: Java 6 (64, server): 1 key interned 12.47x uninterned 3.76x 10 keys interned 8.03x uninterned 3.08x 100 keys interned 6.58x uninterned 2.55x 1000 keys interned 5.39x uninterned 2.69x Java 5 (64, server): 1 key interned 9.84x uninterned 5.03x 10 keys interned 7.00x uninterned 4.61x 100 keys interned 6.61x uninterned 2.28x 1000 keys interned 4.73x uninterned 2.73x Java 4 (32, client): 1 key interned 4.90x uninterned 2.88x 10 keys interned 4.08x uninterned 2.67x 100 keys interned 3.88x uninterned 2.52x 1000 keys interned 3.44x uninterned 2.31x
          Hide
          Yonik Seeley added a comment -

          The thread safety problem has to do with safe object publication (making an object visible to a different thread). Mutable objects generally can't be safely published w/o synchronization.... it has to do with CPU caches in multi-CPU systems.

          IIRC, current x86 architectures will see fewer problems since read barriers are no-ops.. caches are guaranteed to be coherent. But even on x86, you aren't guaranteed that instructions and writes won't be reordered... so the assignment to pool could be visible before all the memory changes that newPool.put() could cause.

          Show
          Yonik Seeley added a comment - The thread safety problem has to do with safe object publication (making an object visible to a different thread). Mutable objects generally can't be safely published w/o synchronization.... it has to do with CPU caches in multi-CPU systems. IIRC, current x86 architectures will see fewer problems since read barriers are no-ops.. caches are guaranteed to be coherent. But even on x86, you aren't guaranteed that instructions and writes won't be reordered... so the assignment to pool could be visible before all the memory changes that newPool.put() could cause.
          Hide
          Derek DeMarco added a comment -

          Good point Yonik. Making pool volatile should take care of it, but only on JVMs 1.5+, as before that volatile didn't prevent reordering of non-volatile reads/writes around it.

          Show
          Derek DeMarco added a comment - Good point Yonik. Making pool volatile should take care of it, but only on JVMs 1.5+, as before that volatile didn't prevent reordering of non-volatile reads/writes around it.
          Hide
          Earwin Burrfoot added a comment -

          Okay, you're probably right. It's not that hashmap can be corrupted on subsequent write, it's that reader threads can possibly see not-yet-built hashmap.
          And volatile works, while simple sync doesn't, because volatile also orders reads.
          Hm. Hm. Hm.
          Than, as you suggested, all we need is our own hash implementation that remains usable even being partially updated. I skipped through your patch and something looks suspicious to me. Like lack of collision resolve and that line:
          int slot = h & (cache.length-1);
          It'll break on non-power-of-two sizes.

          Show
          Earwin Burrfoot added a comment - Okay, you're probably right. It's not that hashmap can be corrupted on subsequent write, it's that reader threads can possibly see not-yet-built hashmap. And volatile works, while simple sync doesn't, because volatile also orders reads. Hm. Hm. Hm. Than, as you suggested, all we need is our own hash implementation that remains usable even being partially updated. I skipped through your patch and something looks suspicious to me. Like lack of collision resolve and that line: int slot = h & (cache.length-1); It'll break on non-power-of-two sizes.
          Hide
          Mark Miller added a comment -

          The thread safety problem has to do with safe object publication

          Oh, great, you mean basic concurrency competency Well thats an embarrassing erosion of knowledge. Back to the books. Thread a has no happens before relationship with thread b unless they share the lock. I trained myself to just synchronize or use volatile long ago (and then let the knowledge begin seeping I guess), but even still, every time I see one of these double lock check type tricks I go, eh, this one must work or something.

          Show
          Mark Miller added a comment - The thread safety problem has to do with safe object publication Oh, great, you mean basic concurrency competency Well thats an embarrassing erosion of knowledge. Back to the books. Thread a has no happens before relationship with thread b unless they share the lock. I trained myself to just synchronize or use volatile long ago (and then let the knowledge begin seeping I guess), but even still, every time I see one of these double lock check type tricks I go, eh, this one must work or something.
          Hide
          Yonik Seeley added a comment - - edited

          lack of collision resolve

          My version was the most basic... a simple cache that is not guaranteed to always hit. I also wanted to keep the overhead very low in case of misses (hence no re-probing). In the best case, I don't think one can get much faster... and in the worst case it won't be much slower than simple String.intern(). There could be other implementations that do resizing of course.

          It'll break on non-power-of-two sizes.

          The size is guaranteed to be a power of two by the constructor.

          Show
          Yonik Seeley added a comment - - edited lack of collision resolve My version was the most basic... a simple cache that is not guaranteed to always hit. I also wanted to keep the overhead very low in case of misses (hence no re-probing). In the best case, I don't think one can get much faster... and in the worst case it won't be much slower than simple String.intern(). There could be other implementations that do resizing of course. It'll break on non-power-of-two sizes. The size is guaranteed to be a power of two by the constructor.
          Hide
          Yonik Seeley added a comment -

          Here's another version that uses closed hashing and table resizes to cache all interned strings requested. It remains lock free and memory barrier free on the read side.

          Show
          Yonik Seeley added a comment - Here's another version that uses closed hashing and table resizes to cache all interned strings requested. It remains lock free and memory barrier free on the read side.
          Hide
          Yonik Seeley added a comment -

          corrected load factor check (the previous code actually calculated 80%, not 75%).

          Show
          Yonik Seeley added a comment - corrected load factor check (the previous code actually calculated 80%, not 75%).
          Hide
          Earwin Burrfoot added a comment -

          Okay, I thought more about that. Yonik is amazing.

          The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision. Then we should introduce an upper bound for this process, for it not to blow up. Finally, we can use our upper bound for hash size from the start.

          I benchmarked a bit, it works better than HashTable.
          Somewhat better for already interned strings, much better for noninterned strings.
          "s1==s2 || s1.compareTo(s2) == 0" combo amazingly works faster than s1.equals(s2).
          Additional hashcode check makes sparse hash access a bit slower and doesn't really help with crowded hash.
          Having a crowded hash degrades performance a lot.

          I updated my patch with Yonik's algorithm. Kept everything in statics (faster), allowed to change cache size through system property for adventurous types, default is 16k (works well for 100 values)

          Show
          Earwin Burrfoot added a comment - Okay, I thought more about that. Yonik is amazing. The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision. Then we should introduce an upper bound for this process, for it not to blow up. Finally, we can use our upper bound for hash size from the start. I benchmarked a bit, it works better than HashTable. Somewhat better for already interned strings, much better for noninterned strings. "s1==s2 || s1.compareTo(s2) == 0" combo amazingly works faster than s1.equals(s2). Additional hashcode check makes sparse hash access a bit slower and doesn't really help with crowded hash. Having a crowded hash degrades performance a lot. I updated my patch with Yonik's algorithm. Kept everything in statics (faster), allowed to change cache size through system property for adventurous types, default is 16k (works well for 100 values)
          Hide
          Earwin Burrfoot added a comment -

          Hehe, ten minute difference. Take over this issue, since you're obviously better at it than I?

          Show
          Earwin Burrfoot added a comment - Hehe, ten minute difference. Take over this issue, since you're obviously better at it than I?
          Hide
          Yonik Seeley added a comment - - edited

          The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision.

          edit: agree, for the first version that was only a cache where collisions invalidate the entry and cause another String.intern() to be called... my comments below are with respect to the second version of my code where interned strings are never dropped from the table.

          Hmmm, in my quick'n'dirty tests of about 256 unique strings, a smaller hash table was actually quicker (initialized with 32 and let it resize vs starting at 1024). I imagine that this would be due to a larger part of the table fitting in smaller and faster processor caches. YMMV. Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings).

          Show
          Yonik Seeley added a comment - - edited The fastest hash we can get, should have no collisions. This is achievable by resizing on each new collision. edit : agree, for the first version that was only a cache where collisions invalidate the entry and cause another String.intern() to be called... my comments below are with respect to the second version of my code where interned strings are never dropped from the table. Hmmm, in my quick'n'dirty tests of about 256 unique strings, a smaller hash table was actually quicker (initialized with 32 and let it resize vs starting at 1024). I imagine that this would be due to a larger part of the table fitting in smaller and faster processor caches. YMMV. Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings).
          Hide
          Earwin Burrfoot added a comment -

          Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings).

          In no-collision resolution scheme, if you detect a collision early with hashcode, you still call String.intern(). That kills any benefit gained from hashcode check vs compareTo. And if no collision is detected, introducing this check slows things down a little. Yes, that surprised me too.

          Show
          Earwin Burrfoot added a comment - Collisions should also be very quick to skip by comparing the hash code (which is cached for Strings). In no-collision resolution scheme, if you detect a collision early with hashcode, you still call String.intern(). That kills any benefit gained from hashcode check vs compareTo. And if no collision is detected, introducing this check slows things down a little. Yes, that surprised me too.
          Hide
          Yonik Seeley added a comment -

          In no-collision resolution scheme, if you detect a collision early with hashcode, you still call String.intern().

          In the old code, yes... that's why I left it commented out there. In the latest code, we re-probe on a collision w/o calling compareTo (assuming hashCodes are unequal) and only call intern() if it's really a String we haven't seen before. The new code is a real hash table that will store all the Strings requested through it, unlike the first variant that was a simple cache.

          Show
          Yonik Seeley added a comment - In no-collision resolution scheme, if you detect a collision early with hashcode, you still call String.intern(). In the old code, yes... that's why I left it commented out there. In the latest code, we re-probe on a collision w/o calling compareTo (assuming hashCodes are unequal) and only call intern() if it's really a String we haven't seen before. The new code is a real hash table that will store all the Strings requested through it, unlike the first variant that was a simple cache.
          Hide
          Earwin Burrfoot added a comment -

          I tried it out. Works a little bit better than simple cache (no stray interns must've paid off), doesn't degrade at all.
          I'd like to change starter value to something 256-1024, it works way better for 10-20 fields.

          Why h >> 7? I understand that you're sacking collision-guilty bits, but why not exact amount that was used (have to store it?), or a whole byte or two?

          Show
          Earwin Burrfoot added a comment - I tried it out. Works a little bit better than simple cache (no stray interns must've paid off), doesn't degrade at all. I'd like to change starter value to something 256-1024, it works way better for 10-20 fields. Why h >> 7? I understand that you're sacking collision-guilty bits, but why not exact amount that was used (have to store it?), or a whole byte or two?
          Hide
          Earwin Burrfoot added a comment -

          Mmm.. what's the status of this one?
          Should I add a patch with Yonik's last hash impl and all calls to String.intern() replaced to get it moving?

          Show
          Earwin Burrfoot added a comment - Mmm.. what's the status of this one? Should I add a patch with Yonik's last hash impl and all calls to String.intern() replaced to get it moving?
          Hide
          Patrick Eger added a comment -

          As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times. For example, a codepath that traverses across 32 unique string datapoints (say, in an inner loop somewhere), would have a minimum 3% probability of colliding and re-interning 2 strings every time through the loop. Because of the birthday paradox, it becomes likely to have such a situation (50% probability with ~150 unique values).

          These are the probabilities of collision, assuming random distribution and perfect hashing. In real life the distribution will not be so random (string.hashCode() & MASK) so these will be "best case".
          2 datapoints: collision prob = 0.006104%
          4 datapoints: collision prob = 0.036617%
          8 datapoints: collision prob = 0.170779%
          16 datapoints: collision prob = 0.729976%
          32 datapoints: collision prob = 2.983863%
          64 datapoints: collision prob = 11.591861%
          128 datapoints: collision prob = 39.188158%
          256 datapoints: collision prob = 86.501947%

          Practically this may or may not matter. My thought is that some sort of fast LRU structure would be better, but of course creating something like this without locking may be tricky. Another idea might be to support some form of limited hash-chaining or probing in the table, which would mitigate the damage of a collision significantly.

          for reference, python code for calculating birthday/hash collisions, shamelessly stolen:
          --------------------
          def bp(n, d):
          v = 1.0
          for i in range:
          v = v * (1 - float/d)
          return 1 - v

          for n in [2, 4, 8, 16, 32, 64, 128, 256]:
          print "%i datapoints: collision prob = %f%%" % (n, bp(n, 16*1024)*100)

          Show
          Patrick Eger added a comment - As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times. For example, a codepath that traverses across 32 unique string datapoints (say, in an inner loop somewhere), would have a minimum 3% probability of colliding and re-interning 2 strings every time through the loop. Because of the birthday paradox, it becomes likely to have such a situation (50% probability with ~150 unique values). These are the probabilities of collision, assuming random distribution and perfect hashing. In real life the distribution will not be so random (string.hashCode() & MASK) so these will be "best case". 2 datapoints: collision prob = 0.006104% 4 datapoints: collision prob = 0.036617% 8 datapoints: collision prob = 0.170779% 16 datapoints: collision prob = 0.729976% 32 datapoints: collision prob = 2.983863% 64 datapoints: collision prob = 11.591861% 128 datapoints: collision prob = 39.188158% 256 datapoints: collision prob = 86.501947% Practically this may or may not matter. My thought is that some sort of fast LRU structure would be better, but of course creating something like this without locking may be tricky. Another idea might be to support some form of limited hash-chaining or probing in the table, which would mitigate the damage of a collision significantly. for reference, python code for calculating birthday/hash collisions, shamelessly stolen: -------------------- def bp(n, d): v = 1.0 for i in range : v = v * (1 - float /d) return 1 - v for n in [2, 4, 8, 16, 32, 64, 128, 256] : print "%i datapoints: collision prob = %f%%" % (n, bp(n, 16*1024)*100)
          Hide
          Yonik Seeley added a comment -

          why h >> 7?

          Was copied from Solr's hashing of doc ids... we didn't want to throw away too many lower bits since they were likely to be the most random. In string hashes, the rightmost bits also have the most entropy.

          Should I add a patch with Yonik's last hash impl and all calls to String.intern() replaced to get it moving?

          That would be helpful, thanks!

          Show
          Yonik Seeley added a comment - why h >> 7? Was copied from Solr's hashing of doc ids... we didn't want to throw away too many lower bits since they were likely to be the most random. In string hashes, the rightmost bits also have the most entropy. Should I add a patch with Yonik's last hash impl and all calls to String.intern() replaced to get it moving? That would be helpful, thanks!
          Hide
          Yonik Seeley added a comment -

          As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times.

          For the first implementation, yes. The latest implementation is guaranteed to only intern a string once ( the hashtable does probing and resizing.)

          Show
          Yonik Seeley added a comment - As a quick comment on the implementation, i notice that it is possible (with reasonable probability) for hash collisions to result in re-interning a pair of strings multiple times. For the first implementation, yes. The latest implementation is guaranteed to only intern a string once ( the hashtable does probing and resizing.)
          Hide
          Patrick Eger added a comment -

          Ah I see thanks. Was looking at the most recent patch (by date).

          Show
          Patrick Eger added a comment - Ah I see thanks. Was looking at the most recent patch (by date).
          Hide
          Earwin Burrfoot added a comment -

          This should do.
          I replaced a pair of intern()s that appeared on trunk after last patch, initial cache size is 1024 and won't probably go up for most people.
          Also renamed some variables while trying to understand the code, can revert them back if it's important for anyone.

          Show
          Earwin Burrfoot added a comment - This should do. I replaced a pair of intern()s that appeared on trunk after last patch, initial cache size is 1024 and won't probably go up for most people. Also renamed some variables while trying to understand the code, can revert them back if it's important for anyone.
          Hide
          Yonik Seeley added a comment - - edited

          The last patch removed the ability to plug a different implementation, but I guess we don't need that until we have another implementation (and it's not clear if the benefits of dumping String.intern() compatability in the future outweigh the disadvantages of breaking back compatibility).

          Rethinking what possible problems this could have... I'm not sure it should be committed in it's current form. One problem is potentially unlimited growth where there was not before. This could happen in a long running search system where users enter a variety of field names that don't even exist.

          A WeakReference based approach would work (as String.intern() uses), but that's more heavyweight and could need synchronization since we are dealing with more complex objects.
          Another option is to go back to a simple cache based approach which could still pin otherwise unreferenced Strings in memory, but would have a bounded size. Perhaps this argues for reinstating the pluggability of the intern implementation.

          Show
          Yonik Seeley added a comment - - edited The last patch removed the ability to plug a different implementation, but I guess we don't need that until we have another implementation (and it's not clear if the benefits of dumping String.intern() compatability in the future outweigh the disadvantages of breaking back compatibility). Rethinking what possible problems this could have... I'm not sure it should be committed in it's current form. One problem is potentially unlimited growth where there was not before. This could happen in a long running search system where users enter a variety of field names that don't even exist. A WeakReference based approach would work (as String.intern() uses), but that's more heavyweight and could need synchronization since we are dealing with more complex objects. Another option is to go back to a simple cache based approach which could still pin otherwise unreferenced Strings in memory, but would have a bounded size. Perhaps this argues for reinstating the pluggability of the intern implementation.
          Hide
          Earwin Burrfoot added a comment -

          Is there 'any' benefit of dumping String.intern() compatibility? All those interns happen at startup anyway.
          If future brings us something better (which I doubt in this case), we'll just swap the impl, which is all-private.

          All in all I consider a practice of system-property-driven pluggable implementations a sickening one. Take MMapDirectory that was impossible to use until now without twiddling with startup keys, take SegmentReader/ReadonlySegmentReader - resulting code is ugly and nobody's going to override defaults anyway, all important dependencies are package-private.

          Show
          Earwin Burrfoot added a comment - Is there 'any' benefit of dumping String.intern() compatibility? All those interns happen at startup anyway. If future brings us something better (which I doubt in this case), we'll just swap the impl, which is all-private. All in all I consider a practice of system-property-driven pluggable implementations a sickening one. Take MMapDirectory that was impossible to use until now without twiddling with startup keys, take SegmentReader/ReadonlySegmentReader - resulting code is ugly and nobody's going to override defaults anyway, all important dependencies are package-private.
          Hide
          Earwin Burrfoot added a comment -

          A top bound on cache size will do? If you're fed too much unique strings it'll end up with characteristics of simple cache.

          Show
          Earwin Burrfoot added a comment - A top bound on cache size will do? If you're fed too much unique strings it'll end up with characteristics of simple cache.
          Hide
          Yonik Seeley added a comment -

          Here is another alternative that is limited in size to prevent unbounded growth, and should still handle collisions acceptably. It's completely lockless (even for updates) and uses open hashing. Each bucket is an insertion-order cache (inserts at the head) and is limited to a certain length.

          Show
          Yonik Seeley added a comment - Here is another alternative that is limited in size to prevent unbounded growth, and should still handle collisions acceptably. It's completely lockless (even for updates) and uses open hashing. Each bucket is an insertion-order cache (inserts at the head) and is limited to a certain length.
          Hide
          Shalin Shekhar Mangar added a comment -

          Yonik, the string is being interned twice in your latest patch

          +      if (e==null) {
          +        s = s.intern();
          +        arr[slot] = new Entry(s.intern(), h, first);
          
          Show
          Shalin Shekhar Mangar added a comment - Yonik, the string is being interned twice in your latest patch + if (e== null ) { + s = s.intern(); + arr[slot] = new Entry(s.intern(), h, first);
          Hide
          Yonik Seeley added a comment -

          Yonik, the string is being interned twice in your latest patch

          Thanks - I had actually fixed that... but it didn't make it into the patch apparently

          Show
          Yonik Seeley added a comment - Yonik, the string is being interned twice in your latest patch Thanks - I had actually fixed that... but it didn't make it into the patch apparently
          Hide
          Earwin Burrfoot added a comment -

          Bug in previous algo (unbounded hash):

          // check cached hashCode first, and use compareTo to avoid
          // dynamic type checking in equals().
          if (h == other.hashCode() && s.compareTo(other)==0)

          { return s; <--- here we should return 'other' }
          Show
          Earwin Burrfoot added a comment - Bug in previous algo (unbounded hash): // check cached hashCode first, and use compareTo to avoid // dynamic type checking in equals(). if (h == other.hashCode() && s.compareTo(other)==0) { return s; <--- here we should return 'other' }
          Hide
          Yonik Seeley added a comment -

          Here's a slightly updated patch (javadoc and StringHelper.intern() convenience method).

          The method of plugging a different StringInterner implementation would be to simply assign to the static in StringHelper.

            /**
             * Expert:
             * The StringInterner implementation used by Lucene.
             * This should never be changed after other Lucene APIs have been used.
             */
            public static StringInterner interner = new SimpleStringInterner(1024,8);
          

          Are people comfortable with this approach? How about the defaults? This should speed up anyone with under thousands of field, and only slightly slow down someone with over thousands of fields (until they plugged in a new interner with a bigger table size).

          Show
          Yonik Seeley added a comment - Here's a slightly updated patch (javadoc and StringHelper.intern() convenience method). The method of plugging a different StringInterner implementation would be to simply assign to the static in StringHelper. /** * Expert: * The StringInterner implementation used by Lucene. * This should never be changed after other Lucene APIs have been used. */ public static StringInterner interner = new SimpleStringInterner(1024,8); Are people comfortable with this approach? How about the defaults? This should speed up anyone with under thousands of field, and only slightly slow down someone with over thousands of fields (until they plugged in a new interner with a bigger table size).
          Hide
          Michael McCandless added a comment -

          Yonik is this ready to go in...?

          Show
          Michael McCandless added a comment - Yonik is this ready to go in...?
          Hide
          Yonik Seeley added a comment -

          I think so... but I was waiting for some kind of feedback if people in general thought it was the right approach. It introduces another static, and people tend to not like that. I accidentally didn't upload the latest version with the javadoc + helper method. I'll do that now.

          Show
          Yonik Seeley added a comment - I think so... but I was waiting for some kind of feedback if people in general thought it was the right approach. It introduces another static, and people tend to not like that. I accidentally didn't upload the latest version with the javadoc + helper method. I'll do that now.
          Hide
          Yonik Seeley added a comment -

          latest patch - could use a multi-threaded testcase to ensure no exceptions are thrown and that intern() always returns the same instance.

          Show
          Yonik Seeley added a comment - latest patch - could use a multi-threaded testcase to ensure no exceptions are thrown and that intern() always returns the same instance.
          Hide
          Earwin Burrfoot added a comment -

          but I was waiting for some kind of feedback if people in general thought it was the right approach. It introduces another static, and people tend to not like that.

          Just forgot somehow about this issue.
          You're right about static, it's not clear how and when to initialize it, plus you introduce some public classes we'll be unable to change/remove later.
          I still have a feeling we should expose a single static method - intern() and hide implementation away, possibly tuning it to be advantageous for <thousands of fields, and degrading to raw String.intern() level if there are more fields.

          I'm going to be away from AC power for three days starting now, so I won't be able to reply until then.

          Show
          Earwin Burrfoot added a comment - but I was waiting for some kind of feedback if people in general thought it was the right approach. It introduces another static, and people tend to not like that. Just forgot somehow about this issue. You're right about static, it's not clear how and when to initialize it, plus you introduce some public classes we'll be unable to change/remove later. I still have a feeling we should expose a single static method - intern() and hide implementation away, possibly tuning it to be advantageous for <thousands of fields, and degrading to raw String.intern() level if there are more fields. I'm going to be away from AC power for three days starting now, so I won't be able to reply until then.
          Hide
          Yonik Seeley added a comment -

          Latest patch attached with multi-threaded test.

          Some quick performance tests:
          10240 unique strings, 100 threads: 29% faster (larger than the cache capacity - will lead to evictions)
          100 unique strings, 100 threads: 147% faster
          100 unique strings, 2 threads: 152% faster

          Times were simply the time it took the junit test to run - includes other overhead like random number generation and error checking.

          Show
          Yonik Seeley added a comment - Latest patch attached with multi-threaded test. Some quick performance tests: 10240 unique strings, 100 threads: 29% faster (larger than the cache capacity - will lead to evictions) 100 unique strings, 100 threads: 147% faster 100 unique strings, 2 threads: 152% faster Times were simply the time it took the junit test to run - includes other overhead like random number generation and error checking.
          Hide
          Yonik Seeley added a comment -

          I still have a feeling we should expose a single static method - intern() and hide implementation away

          While the default should be beneficial for most users, I'd hate to lock away a users ability to either expand or remove the cache.

          Show
          Yonik Seeley added a comment - I still have a feeling we should expose a single static method - intern() and hide implementation away While the default should be beneficial for most users, I'd hate to lock away a users ability to either expand or remove the cache.
          Hide
          Mark Miller added a comment -

          I assume we can assign this one to you Yonik?

          Show
          Mark Miller added a comment - I assume we can assign this one to you Yonik?
          Hide
          Yonik Seeley added a comment -

          I've held off because of a lack of consensus, but I suppose I can do the old "i'll commit in a few days" thing.

          Show
          Yonik Seeley added a comment - I've held off because of a lack of consensus, but I suppose I can do the old "i'll commit in a few days" thing.
          Hide
          Mark Miller added a comment -

          Or push to 3.1. I have no preference if it goes in as is, just looking for it to be managed one way or another for 2.9 release.

          Show
          Mark Miller added a comment - Or push to 3.1. I have no preference if it goes in as is, just looking for it to be managed one way or another for 2.9 release.
          Hide
          Michael McCandless added a comment -

          I've held off because of a lack of consensus

          Silence = consensus!

          Show
          Michael McCandless added a comment - I've held off because of a lack of consensus Silence = consensus!
          Hide
          Earwin Burrfoot added a comment -

          Okay, let's have an extra class and ability to switch impls. I liked that static method could get inlined (at least its short-path), but that's not necessary.

          Except I'd like the javadoc demand each impl to be String.intern()-compatible. There's nothing bad in it, as in any decent impl an unique string will be String.intern()'ed one time at most. And the case when you get an infinite flow of unique strings is degenerate anyway, you have to fix something, not deal with it. On the other hand, we can remove "This should never be changed after other Lucene APIs have been used" clause.

          rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity?
          'Entry[] arr = cache;' - this can be skipped? 'cache' is already final and optimizer loves finals. Plus further down the method you use both cache[slot] and arr[slot]. Or am I missing some voodoo?
          If check around 'nextToLast = e' can also be removed?
          'public String intern(char[] arr, int offset, int len)' - is this needed?

          Show
          Earwin Burrfoot added a comment - Okay, let's have an extra class and ability to switch impls. I liked that static method could get inlined (at least its short-path), but that's not necessary. Except I'd like the javadoc demand each impl to be String.intern()-compatible. There's nothing bad in it, as in any decent impl an unique string will be String.intern()'ed one time at most. And the case when you get an infinite flow of unique strings is degenerate anyway, you have to fix something, not deal with it. On the other hand, we can remove "This should never be changed after other Lucene APIs have been used" clause. rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity? 'Entry[] arr = cache;' - this can be skipped? 'cache' is already final and optimizer loves finals. Plus further down the method you use both cache [slot] and arr [slot] . Or am I missing some voodoo? If check around 'nextToLast = e' can also be removed? 'public String intern(char[] arr, int offset, int len)' - is this needed?
          Hide
          Mark Miller added a comment -

          looks like this is so close ...

          Show
          Mark Miller added a comment - looks like this is so close ...
          Hide
          Yonik Seeley added a comment -

          Except I'd like the javadoc demand each impl to be String.intern()-compatible.

          If everything went through the same intern, it wouldn't matter.

          rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity?

          done.

          If check around 'nextToLast = e' can also be removed?

          I don't see how...

          Show
          Yonik Seeley added a comment - Except I'd like the javadoc demand each impl to be String.intern()-compatible. If everything went through the same intern, it wouldn't matter. rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity? done. If check around 'nextToLast = e' can also be removed? I don't see how...
          Hide
          Yonik Seeley added a comment -

          committed.

          Show
          Yonik Seeley added a comment - committed.
          Hide
          Uwe Schindler added a comment -

          I found one real occurrence of intern() in crontrib and a now obsolete discussion in one comment.

          Attached is a simple patch.

          What I currently do not like is that there are still a lot of intern()s in the Javadocs/comments elsewhere, maybe this should also be changed (code refactoring is not able to do that, but a simple grep).

          Show
          Uwe Schindler added a comment - I found one real occurrence of intern() in crontrib and a now obsolete discussion in one comment. Attached is a simple patch. What I currently do not like is that there are still a lot of intern()s in the Javadocs/comments elsewhere, maybe this should also be changed (code refactoring is not able to do that, but a simple grep).
          Hide
          Uwe Schindler added a comment -

          Committed rev 802095.

          Show
          Uwe Schindler added a comment - Committed rev 802095.
          Hide
          Noble Paul added a comment -

          isn't it possible to make the call to

          public String intern(char[] arr, int offset, int len) {
          }
          

          not create a String?
          currently , if I have a char[] it ends up creating a String (which makes a copy of the char[]) before it does the intern() operation.

          Show
          Noble Paul added a comment - isn't it possible to make the call to public String intern( char [] arr, int offset, int len) { } not create a String? currently , if I have a char[] it ends up creating a String (which makes a copy of the char[]) before it does the intern() operation.
          Hide
          Noble Paul added a comment -

          I haven't quite tested it. But it is an idea as a patch.

          Show
          Noble Paul added a comment - I haven't quite tested it. But it is an idea as a patch.
          Hide
          Yonik Seeley added a comment -

          isn't it possible to make the call to
          public String intern(char[] arr, int offset, int len) {}
          not create a String?

          Yep - that's why I added that method (just didn't get around to implementing it that way).

          Show
          Yonik Seeley added a comment - isn't it possible to make the call to public String intern(char[] arr, int offset, int len) {} not create a String? Yep - that's why I added that method (just didn't get around to implementing it that way).

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Earwin Burrfoot
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development