Lucene - Core
  1. Lucene - Core
  2. LUCENE-2216

OpenBitSet#hashCode() may return false for identical sets.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9, 2.9.1, 3.0
    • Fix Version/s: 2.9.4, 3.0.3
    • Component/s: core/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      OpenBitSet uses an internal buffer of long variables to store set bits and an additional 'wlen' index that points
      to the highest used component inside

      {@link #bits}

      buffer.

      Unlike in JDK, the wlen field is not continuously maintained (on clearing bits, for example). This leads to a situation when wlen may point
      far beyond the last set bit.

      The hashCode implementation iterates over all long components of the bits buffer, rotating the hash even for empty components. This is against the contract of hashCode-equals. The following test case illustrates this:

      // initialize two bitsets with different capacity (bits length).
      BitSet bs1 = new BitSet(200);
      BitSet bs2 = new BitSet(64);
      // set the same bit.
      bs1.set(3);
      bs2.set(3);
              
      // equals returns true (passes).
      assertEquals(bs1, bs2);
      // hashCode returns false (against contract).
      assertEquals(bs1.hashCode(), bs2.hashCode());
      

      Fix and test case attached.

      1. LUCENE-2216.patch
        0.9 kB
        Yonik Seeley
      2. openbitset.patch
        2 kB
        Dawid Weiss

        Activity

        Hide
        Yonik Seeley added a comment -

        Thanks Dawid!

        hashCode and equals probably shouldn't be modifying the state of the object though, right?
        It's also not thread safe, so a lot of weird things could happen... the simplest example is that two threads could check that the last word is all zeros and both decrement wlen.

        I like the spirit of your change though, as it only adds to the cost of hashCode/equals (which are already very expensive with large bitsets and should be avoided if possible anyway).

        Show
        Yonik Seeley added a comment - Thanks Dawid! hashCode and equals probably shouldn't be modifying the state of the object though, right? It's also not thread safe, so a lot of weird things could happen... the simplest example is that two threads could check that the last word is all zeros and both decrement wlen. I like the spirit of your change though, as it only adds to the cost of hashCode/equals (which are already very expensive with large bitsets and should be avoided if possible anyway).
        Hide
        Yonik Seeley added a comment -

        I haven't tested this patch, but this seems like a simple solution. Start with a zero hashcode while iterating backward and the trailing zeros won't affect the hashcode.

        Show
        Yonik Seeley added a comment - I haven't tested this patch, but this seems like a simple solution. Start with a zero hashcode while iterating backward and the trailing zeros won't affect the hashcode.
        Hide
        Dawid Weiss added a comment -

        Hi Yonik,

        This class is not thread-safe anyway (there are no memory barriers of any kind anywhere in the code).

        From a single-thread perspective, yes, you are modifying the internal state of this object, but it's not really affecting anything other than possibly speeding up further interaction with this object (any other operation no OpenBitSets is affected by the value inside wlen).

        Your patch also solves the issue, of course. I just don't see the point in not updating wlen since you're scanning through memory anyway... The implementation of OpenBitSet is different in this regard to java.util.BitSet, which always maintains the last non-empty index. I've been thinking about it a bit and there are pros and cons to both implementations, but lazily moving wlen when memory is scanned anyway seems like a better alternative than keeping wlen unnecessarily large (which affects ORs, ANDs and other set operations).

        To me this implementation cannot be used in a multi-threaded application anyway, am I wrong here?

        D.

        Show
        Dawid Weiss added a comment - Hi Yonik, This class is not thread-safe anyway (there are no memory barriers of any kind anywhere in the code). From a single-thread perspective, yes, you are modifying the internal state of this object, but it's not really affecting anything other than possibly speeding up further interaction with this object (any other operation no OpenBitSets is affected by the value inside wlen). Your patch also solves the issue, of course. I just don't see the point in not updating wlen since you're scanning through memory anyway... The implementation of OpenBitSet is different in this regard to java.util.BitSet, which always maintains the last non-empty index. I've been thinking about it a bit and there are pros and cons to both implementations, but lazily moving wlen when memory is scanned anyway seems like a better alternative than keeping wlen unnecessarily large (which affects ORs, ANDs and other set operations). To me this implementation cannot be used in a multi-threaded application anyway, am I wrong here? D.
        Hide
        Dawid Weiss added a comment -

        Perhaps this is for another patch, but BitUtil contains several bit-counting methods (pop, ntz) that have been implemented in the JDK in the same way (Hacker's Delight) and will come with HotSpot intrinsics for the new Intels (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6823354). On the other hand, Lucene's implementation may be useful for folks with older VMs...

        Show
        Dawid Weiss added a comment - Perhaps this is for another patch, but BitUtil contains several bit-counting methods (pop, ntz) that have been implemented in the JDK in the same way (Hacker's Delight) and will come with HotSpot intrinsics for the new Intels ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6823354 ). On the other hand, Lucene's implementation may be useful for folks with older VMs...
        Hide
        Yonik Seeley added a comment -

        To me this implementation cannot be used in a multi-threaded application anyway, am I wrong here?

        Pretty much any mutable object may be safely shared with other threads after it's done being modified. So one thread could create, and many threads could read. I don't know how explicitly it's spelled out in Java, but hashCode and equals shouldn't modify the object's state in any meaningful way.

        Show
        Yonik Seeley added a comment - To me this implementation cannot be used in a multi-threaded application anyway, am I wrong here? Pretty much any mutable object may be safely shared with other threads after it's done being modified. So one thread could create, and many threads could read. I don't know how explicitly it's spelled out in Java, but hashCode and equals shouldn't modify the object's state in any meaningful way.
        Hide
        Dawid Weiss added a comment -

        This is only true if there is happens-before between the reads and the modifications to the object. In any other case other threads may be reading stale values (i.e., from their own cache), at least if my understanding of the jmm is correct here. Whether you want to rely on such a deep semantics of interaction between threads is something to consider deeply, at least in my personal opinion.

        Show
        Dawid Weiss added a comment - This is only true if there is happens-before between the reads and the modifications to the object. In any other case other threads may be reading stale values (i.e., from their own cache), at least if my understanding of the jmm is correct here. Whether you want to rely on such a deep semantics of interaction between threads is something to consider deeply, at least in my personal opinion.
        Hide
        Dawid Weiss added a comment - - edited

        This is not entirely what I had in mind (it's not cache, but HotSpot optimisation), but similar situation applies (the value of the field that's never modified from the perspective of the current thread is never re-read).

        public class Example10 {
            private static Holder holder;
        
            public static void startThread() {
                new Thread() {
                    public void run() {
                        try { sleep(2000); } catch (Exception e) { /* ignore */ }
                        holder.ready = true;
                        System.out.println("Setting ready to true.");
                    }
                }.start();
            }
        
            public static void main(String [] args) {
                holder = new Holder();
                startThread();
                while (!holder.ready) {
                    // Do nothing.
                }
                System.out.println("I'm ready.");
            }
        }
        
        class Holder {
            public boolean ready;
        }
        

        If you run it with -server, it will (should... or does on two machines I own) deadlock. Client mode and interpreted mode are not optimized, so it passes.

        Show
        Dawid Weiss added a comment - - edited This is not entirely what I had in mind (it's not cache, but HotSpot optimisation), but similar situation applies (the value of the field that's never modified from the perspective of the current thread is never re-read). public class Example10 { private static Holder holder; public static void startThread() { new Thread () { public void run() { try { sleep(2000); } catch (Exception e) { /* ignore */ } holder.ready = true ; System .out.println( "Setting ready to true ." ); } }.start(); } public static void main( String [] args) { holder = new Holder(); startThread(); while (!holder.ready) { // Do nothing. } System .out.println( "I'm ready." ); } } class Holder { public boolean ready; } If you run it with -server, it will (should... or does on two machines I own) deadlock. Client mode and interpreted mode are not optimized, so it passes.
        Hide
        Yonik Seeley added a comment -

        This is only true if there is happens-before between the reads and the modifications to the object.

        Of course... I said "may be safely shared', not that any method one chooses to share it is correct.
        It still seems that promoting hashCode and equals to mutating operations is wrong, no?

        Show
        Yonik Seeley added a comment - This is only true if there is happens-before between the reads and the modifications to the object. Of course... I said "may be safely shared', not that any method one chooses to share it is correct. It still seems that promoting hashCode and equals to mutating operations is wrong, no?
        Hide
        Dawid Weiss added a comment -

        uff, I started having doubts in my own understanding, thanks for being patient with me.

        I agree that having hashCode mutate the object's state is weird. I had some thoughts about it – this particular mutation seems to be "safe" even from multi-threaded point of view. If another thread sees a stale value of wlen, then the only thing that is going to happen is it will scan more memory; for ands, ors and other types of operations this will have no effect. So assuming hashCode/equals is the ONLY method you're calling concurrently, it shouldn't break things. A similar kind of trickery goes on in String#hashCode (caching to a non-volatile field), although that object is immutable, so it's a slightly different scenario.

        To be honest, my preference for this would be to either maintain the wlen field during all operations (like java.util.BitSet) or at least to clearly state (JavaDoc?) that trimTrailingZeros() should be invoked prior to publishing the object for other threads for increased performance (in case you fiddle with bits and clear the tail). In the second options, your patch does a fine job of not mutating the object and correcting the bug.

        Thanks for an interesting discussion.

        Show
        Dawid Weiss added a comment - uff, I started having doubts in my own understanding, thanks for being patient with me. I agree that having hashCode mutate the object's state is weird. I had some thoughts about it – this particular mutation seems to be "safe" even from multi-threaded point of view. If another thread sees a stale value of wlen, then the only thing that is going to happen is it will scan more memory; for ands, ors and other types of operations this will have no effect. So assuming hashCode/equals is the ONLY method you're calling concurrently, it shouldn't break things. A similar kind of trickery goes on in String#hashCode (caching to a non-volatile field), although that object is immutable, so it's a slightly different scenario. To be honest, my preference for this would be to either maintain the wlen field during all operations (like java.util.BitSet) or at least to clearly state (JavaDoc?) that trimTrailingZeros() should be invoked prior to publishing the object for other threads for increased performance (in case you fiddle with bits and clear the tail). In the second options, your patch does a fine job of not mutating the object and correcting the bug. Thanks for an interesting discussion.
        Hide
        Yonik Seeley added a comment - - edited

        I agree that having hashCode mutate the object's state is weird. I had some thoughts about it - this particular mutation seems to be "safe" even from multi-threaded point of view. If another thread sees a stale value of wlen, then the only thing that is going to happen is it will scan more memory;

        There are still quite a few things that can go wrong I think. If all threads only called hashCode and equals, then you might be right... it's very specific to the implementation of trimTrailingZeros()

           public void trimTrailingZeros() {
            int idx = wlen-1;
            while (idx>=0 && bits[idx]==0) idx--;
            wlen = idx+1;
          }
        

        What could make that work is the fact that wlen is an integer, is never directly used as the loop counter, or as an index into the array.

        But the other big questions: are other read operations tolerant of wlen changing out from under them? My guess would be no.
        Look at xorCount for example:

            if (a.wlen < b.wlen) {
              tot += BitUtil.pop_array(b.bits, a.wlen, b.wlen-a.wlen);
        

        hashCode and equals changing wlen could cause a negative value to be passed to pop_array.

        edit: deleted second example, which isn't different from the first (the issue is safety with other read ops).

        Show
        Yonik Seeley added a comment - - edited I agree that having hashCode mutate the object's state is weird. I had some thoughts about it - this particular mutation seems to be "safe" even from multi-threaded point of view. If another thread sees a stale value of wlen, then the only thing that is going to happen is it will scan more memory; There are still quite a few things that can go wrong I think. If all threads only called hashCode and equals, then you might be right... it's very specific to the implementation of trimTrailingZeros() public void trimTrailingZeros() { int idx = wlen-1; while (idx>=0 && bits[idx]==0) idx--; wlen = idx+1; } What could make that work is the fact that wlen is an integer, is never directly used as the loop counter, or as an index into the array. But the other big questions: are other read operations tolerant of wlen changing out from under them? My guess would be no. Look at xorCount for example: if (a.wlen < b.wlen) { tot += BitUtil.pop_array(b.bits, a.wlen, b.wlen-a.wlen); hashCode and equals changing wlen could cause a negative value to be passed to pop_array. edit: deleted second example, which isn't different from the first (the issue is safety with other read ops).
        Hide
        Dawid Weiss added a comment -

        Chances of this happening are really slim (this would probably be a single inlined read as soon as the compilation takes place, but you're right in the general case. I am not arguing changing the object in hashCode is good – my argument is that ideally it should be fixed elsewhere (as in my previous suggestion – either updating wlen every time the tail changes, or make explicit changes to the documentation that inform about suboptimal performance for zero-tailed sets).

        Show
        Dawid Weiss added a comment - Chances of this happening are really slim (this would probably be a single inlined read as soon as the compilation takes place, but you're right in the general case. I am not arguing changing the object in hashCode is good – my argument is that ideally it should be fixed elsewhere (as in my previous suggestion – either updating wlen every time the tail changes, or make explicit changes to the documentation that inform about suboptimal performance for zero-tailed sets).
        Hide
        Dawid Weiss added a comment -

        For what it's worth, I checked the mentioned BitUtil methods – ntz/pop; the same implementation is included from Java 1.5 upward. Do you want me to file another patch for this, Yonik, or are we leaving this as-is? I'd redirect from BitUtil to Long/Integer, deprecate BitUtil methods and replace the places in the code where they are used.

        Show
        Dawid Weiss added a comment - For what it's worth, I checked the mentioned BitUtil methods – ntz/pop; the same implementation is included from Java 1.5 upward. Do you want me to file another patch for this, Yonik, or are we leaving this as-is? I'd redirect from BitUtil to Long/Integer, deprecate BitUtil methods and replace the places in the code where they are used.
        Hide
        Yonik Seeley added a comment -

        my argument is that ideally it should be fixed elsewhere

        This is an expert-level class... I don't think that every call to clear() should be checking if it completely cleared the last word. It's easy enough to call trimTrailingZeros after you did a bunch of modifications... but not so easy to regain the lost performance for the code doing redundant checking you didn't want.

        Show
        Yonik Seeley added a comment - my argument is that ideally it should be fixed elsewhere This is an expert-level class... I don't think that every call to clear() should be checking if it completely cleared the last word. It's easy enough to call trimTrailingZeros after you did a bunch of modifications... but not so easy to regain the lost performance for the code doing redundant checking you didn't want.
        Hide
        Dawid Weiss added a comment -

        Ok, argument accepted.

        Show
        Dawid Weiss added a comment - Ok, argument accepted.
        Hide
        Yonik Seeley added a comment -

        For what it's worth, I checked the mentioned BitUtil methods - ntz/pop; the same implementation is included from Java 1.5 upward.

        Huh - I didn't realize that Java5 had the same pop impl as I did... it will be cool if it finally starts using native POPCNT instructions.

        As far as ntz, I went though a lot of micro-optimizations and different implementations before I settled on the one used in BitUtil, so it would be nice to do some benchmarks to see if it's truly faster now (and also what the performance difference is for users of JVMs before this optimization was implemented).

        Show
        Yonik Seeley added a comment - For what it's worth, I checked the mentioned BitUtil methods - ntz/pop; the same implementation is included from Java 1.5 upward. Huh - I didn't realize that Java5 had the same pop impl as I did... it will be cool if it finally starts using native POPCNT instructions. As far as ntz, I went though a lot of micro-optimizations and different implementations before I settled on the one used in BitUtil, so it would be nice to do some benchmarks to see if it's truly faster now (and also what the performance difference is for users of JVMs before this optimization was implemented).
        Hide
        Dawid Weiss added a comment -

        Ah, ok – I thought ntz in BitUtils is the same as in hacker's delight, but it isn't. Microbenchmarks will always be misleading as they depend a lot on how you test, but I can do it out of sheer curiosity – will report tomorrow.

        Show
        Dawid Weiss added a comment - Ah, ok – I thought ntz in BitUtils is the same as in hacker's delight, but it isn't. Microbenchmarks will always be misleading as they depend a lot on how you test, but I can do it out of sheer curiosity – will report tomorrow.
        Hide
        Yonik Seeley added a comment -

        Microbenchmarks will always be misleading as they depend a lot on how you test, but I can do it out of sheer curiosity - will report tomorrow.

        Cool. I'd recommend testing in the context of OpenBitSet (i.e. don't try testing ntz directly).
        Perhaps just create a large random set (~1M bits) with a certain percent of bits set, and then iterate over those set bits.

        Show
        Yonik Seeley added a comment - Microbenchmarks will always be misleading as they depend a lot on how you test, but I can do it out of sheer curiosity - will report tomorrow. Cool. I'd recommend testing in the context of OpenBitSet (i.e. don't try testing ntz directly). Perhaps just create a large random set (~1M bits) with a certain percent of bits set, and then iterate over those set bits.
        Hide
        Yonik Seeley added a comment -

        Committed in trunk. Thanks for bringing this up!

        Show
        Yonik Seeley added a comment - Committed in trunk. Thanks for bringing this up!
        Hide
        Robert Muir added a comment -

        reopening for possible 2.9.4/3.0.3 backport.

        Show
        Robert Muir added a comment - reopening for possible 2.9.4/3.0.3 backport.
        Hide
        Uwe Schindler added a comment -

        Backported 3.0 revision: 1038096
        Backported 2.9 revision: 1038098

        Show
        Uwe Schindler added a comment - Backported 3.0 revision: 1038096 Backported 2.9 revision: 1038098

          People

          • Assignee:
            Unassigned
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development