Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Find a previous set bit in an OpenBitSet.
      Useful for parent testing in nested document query execution LUCENE-2454 .

      1. LUCENE-3179.patch
        8 kB
        Paul Elschot
      2. TestBitUtil.java
        4 kB
        Paul Elschot
      3. LUCENE-3179.patch
        10 kB
        Paul Elschot
      4. LUCENE-3179.patch
        11 kB
        Michael McCandless
      5. TestOpenBitSet.patch
        0.5 kB
        Uwe Schindler
      6. LUCENE-3179-fix.patch
        1 kB
        Uwe Schindler
      7. LUCENE-3179-fix.patch
        2 kB
        Uwe Schindler
      8. LUCENE-3179-long-ntz.patch
        2 kB
        Uwe Schindler
      9. LUCENE-3179-long-ntz.patch
        6 kB
        Uwe Schindler

        Activity

        Hide
        Paul Elschot added a comment -

        Add prevSetBit() and tests. Also moves some test code from TestOpenBitSet to TestBitUtil.

        Show
        Paul Elschot added a comment - Add prevSetBit() and tests. Also moves some test code from TestOpenBitSet to TestBitUtil.
        Hide
        Yonik Seeley added a comment -

        Hey Paul, did you try this implementation against Long.numberOfLeadingZeros?
        The later Oracle Java6 implementations have instrinsified this method, so it might be faster:
        http://bugs.sun.com/view_bug.do?bug_id=6823354

        Show
        Yonik Seeley added a comment - Hey Paul, did you try this implementation against Long.numberOfLeadingZeros? The later Oracle Java6 implementations have instrinsified this method, so it might be faster: http://bugs.sun.com/view_bug.do?bug_id=6823354
        Hide
        Paul Elschot added a comment -

        Correct the issue number in the patch, and remove a superfluous javadoc comment.

        Show
        Paul Elschot added a comment - Correct the issue number in the patch, and remove a superfluous javadoc comment.
        Hide
        Paul Elschot added a comment -

        Correct the issue number in the patch, remove a superfluous javadoc comment, and grant licence ...

        Show
        Paul Elschot added a comment - Correct the issue number in the patch, remove a superfluous javadoc comment, and grant licence ...
        Hide
        Paul Elschot added a comment -

        I did not try this against Long.numberOfLeadingZeros, but in case that is faster we should use that of course.

        Show
        Paul Elschot added a comment - I did not try this against Long.numberOfLeadingZeros, but in case that is faster we should use that of course.
        Hide
        Uwe Schindler added a comment -

        If it's faster, should we not replace it completely in Lucene? The impl in Java 5 (Sun JDK) is identical to ours from BitUtils, so why replicate? If it gets intrinsic, it can only get faster. I assume its a relict from pre-Java-1.5 times like Lucene 2.9.

        Show
        Uwe Schindler added a comment - If it's faster, should we not replace it completely in Lucene? The impl in Java 5 (Sun JDK) is identical to ours from BitUtils, so why replicate? If it gets intrinsic, it can only get faster. I assume its a relict from pre-Java-1.5 times like Lucene 2.9.
        Hide
        Uwe Schindler added a comment -

        With the previous comment I also refer to nextSetBit().

        Show
        Uwe Schindler added a comment - With the previous comment I also refer to nextSetBit().
        Hide
        Dawid Weiss added a comment -

        I posted the benchmarks of intrinsic vs. manual (OpenBitSet) performance of nlz and pop (bitcount) methods a while ago – they should still be around JIRA somewhere. If I recall right, the difference was significant, although not like an order of magnitude or something... and on CPUs without intrinsic instructions the implementation handcrafted by Yonik was actually faster than the one in the standard library. Of course these days most CPUs will have popcnt/ nlz instructions, so it makes sense to switch.

        Show
        Dawid Weiss added a comment - I posted the benchmarks of intrinsic vs. manual (OpenBitSet) performance of nlz and pop (bitcount) methods a while ago – they should still be around JIRA somewhere. If I recall right, the difference was significant, although not like an order of magnitude or something... and on CPUs without intrinsic instructions the implementation handcrafted by Yonik was actually faster than the one in the standard library. Of course these days most CPUs will have popcnt/ nlz instructions, so it makes sense to switch.
        Hide
        Dawid Weiss added a comment -

        I think it's the 1.6 that adds these intrinsics – I don't know if they've been backported to updates to 1.5, but this should be relatively easy to verify empirically.

        Show
        Dawid Weiss added a comment - I think it's the 1.6 that adds these intrinsics – I don't know if they've been backported to updates to 1.5, but this should be relatively easy to verify empirically.
        Hide
        Uwe Schindler added a comment -

        That's strange, the last time I looked into the code from the src.jar standard library it was 1:1 identical (there was even the same reference to the same Hacker's delight article). So I am just confused...

        Show
        Uwe Schindler added a comment - That's strange, the last time I looked into the code from the src.jar standard library it was 1:1 identical (there was even the same reference to the same Hacker's delight article). So I am just confused...
        Hide
        Dawid Weiss added a comment -

        Intrinsics are implemented/added at the hotspot (jit) level, you won't see them in src.jar – all calls to specific methods in Long.* or Integer.* are replaced by handcrafted assembly (usually process-specific instructions that do what a given method should do).

        If you're interested, check out openjdk code of hotspot and scan for intrinsics (or popcnt).

        Show
        Dawid Weiss added a comment - Intrinsics are implemented/added at the hotspot (jit) level, you won't see them in src.jar – all calls to specific methods in Long.* or Integer.* are replaced by handcrafted assembly (usually process-specific instructions that do what a given method should do). If you're interested, check out openjdk code of hotspot and scan for intrinsics (or popcnt).
        Hide
        Uwe Schindler added a comment -

        You misunderstood me, i know what intrinsics are. My confusion was related to that:

        and on CPUs without intrinsic instructions the implementation handcrafted by Yonik was actually faster than the one in the standard library

        And the so called hand crafted method is identical in src.jar and Yonik's code. So without intrinsics, the standard library and Yoniks code should be identical in performance, as it was same code, the last time I looked into it.

        Show
        Uwe Schindler added a comment - You misunderstood me, i know what intrinsics are. My confusion was related to that: and on CPUs without intrinsic instructions the implementation handcrafted by Yonik was actually faster than the one in the standard library And the so called hand crafted method is identical in src.jar and Yonik's code. So without intrinsics, the standard library and Yoniks code should be identical in performance, as it was same code, the last time I looked into it.
        Hide
        Yonik Seeley added a comment -

        And the so called hand crafted method is identical in src.jar and Yonik's code.

        For pop, yes. But not for ntz or pop_array and friends.

        BitUtil.pop exists because this was originally written to work with java1.4 which didn't have Long.bitCount()
        http://markmail.org/message/5ay4m2thsvsahk3c

        Show
        Yonik Seeley added a comment - And the so called hand crafted method is identical in src.jar and Yonik's code. For pop, yes. But not for ntz or pop_array and friends. BitUtil.pop exists because this was originally written to work with java1.4 which didn't have Long.bitCount() http://markmail.org/message/5ay4m2thsvsahk3c
        Hide
        Paul Elschot added a comment -

        The micro benchmarks for ntz() and pop() are at LUCENE-2221

        Show
        Paul Elschot added a comment - The micro benchmarks for ntz() and pop() are at LUCENE-2221
        Hide
        Dawid Weiss added a comment -

        Oh, ok – clear. So, my comment was related to the various methods of doing bitcounts and other bit-fiddling on arrays of long values (for example pop_array) – these are HD derived implementations; I compared them to naive loops using intrinsics and naive loops on cpus (and jvms) without intrinsics – in that case simple loops with intrinsics was faster than Lucene's code, but Lucene's code was faster than simple loops without intrinsics (effectively using whatever was in the std. library).

        Show
        Dawid Weiss added a comment - Oh, ok – clear. So, my comment was related to the various methods of doing bitcounts and other bit-fiddling on arrays of long values (for example pop_array) – these are HD derived implementations; I compared them to naive loops using intrinsics and naive loops on cpus (and jvms) without intrinsics – in that case simple loops with intrinsics was faster than Lucene's code, but Lucene's code was faster than simple loops without intrinsics (effectively using whatever was in the std. library).
        Hide
        Uwe Schindler added a comment -

        OK, so we can sefely remove BitUtil.pop and replace by the Java 5 method (maybe review again the code in src.jar also for ntz). And if this one is an intrinsic in Java 6 its even faster.

        Now we talk the same language

        Show
        Uwe Schindler added a comment - OK, so we can sefely remove BitUtil.pop and replace by the Java 5 method (maybe review again the code in src.jar also for ntz). And if this one is an intrinsic in Java 6 its even faster. Now we talk the same language
        Hide
        Paul Elschot added a comment -

        As to the performance, the current patch at LUCENE-2454 has a bitwise linear search to do this.

        Show
        Paul Elschot added a comment - As to the performance, the current patch at LUCENE-2454 has a bitwise linear search to do this.
        Hide
        Paul Elschot added a comment -

        I did a bit of performance testing (sun java 1.6.0_xx, not the very latest one).

        This is a typical output on my machine (the dummy can be ignored, it is only there to make sure that nothing is optimized away):

        BitUtil nlz time: 5664 picosec/call, dummy: 11572915728
        Long    nlz time: 8464 picosec/call, dummy: 7715277152
        

        That means that the nlz code in the patch is definitely faster than Long.numberOfLeadingZeros for the test arguments used.
        The test arguments are divided roughly evenly for the possible numbers of leading zero bits.

        Show
        Paul Elschot added a comment - I did a bit of performance testing (sun java 1.6.0_xx, not the very latest one). This is a typical output on my machine (the dummy can be ignored, it is only there to make sure that nothing is optimized away): BitUtil nlz time: 5664 picosec/call, dummy: 11572915728 Long nlz time: 8464 picosec/call, dummy: 7715277152 That means that the nlz code in the patch is definitely faster than Long.numberOfLeadingZeros for the test arguments used. The test arguments are divided roughly evenly for the possible numbers of leading zero bits.
        Hide
        Paul Elschot added a comment - - edited

        TestBitUtil.java as in the patch and extended with a testPerfNlz method that gave the output above.

        Show
        Paul Elschot added a comment - - edited TestBitUtil.java as in the patch and extended with a testPerfNlz method that gave the output above.
        Hide
        Dawid Weiss added a comment -

        You're not providing the required contex – what exact JVM and what exact processor did you test on? I've just ran your test on my machine with the following result:

        BitUtil nlz time: 3109 picosec/call, dummy: 20252602524
        Long nlz time: 1279 picosec/call, dummy: 48220482200

        I'm guessing yours didn't use the intrinsic inline at all (for whatever reason). My machine is a fairly old Intel I7 860 running 64-Bit server hotspot 1.6.0_24-b07.

        Show
        Dawid Weiss added a comment - You're not providing the required contex – what exact JVM and what exact processor did you test on? I've just ran your test on my machine with the following result: BitUtil nlz time: 3109 picosec/call, dummy: 20252602524 Long nlz time: 1279 picosec/call, dummy: 48220482200 I'm guessing yours didn't use the intrinsic inline at all (for whatever reason). My machine is a fairly old Intel I7 860 running 64-Bit server hotspot 1.6.0_24-b07.
        Hide
        Paul Elschot added a comment -

        The java.vm.version value 1.6.0_03-b05, java.vm.info value is mixed mode.
        The processor is an Athlon II X3 450 at 800 MHz.

        Since the Long time is about 2.5 times faster than the BitUtil with a 64 bit processor, I'll change the patch to use Long. When the hardware allows better performance, it should be used.

        Show
        Paul Elschot added a comment - The java.vm.version value 1.6.0_03-b05, java.vm.info value is mixed mode. The processor is an Athlon II X3 450 at 800 MHz. Since the Long time is about 2.5 times faster than the BitUtil with a 64 bit processor, I'll change the patch to use Long. When the hardware allows better performance, it should be used.
        Hide
        Paul Elschot added a comment -

        BitUtil.nlz() and the performance test method (renamed to tstPerfNlz()) are still in the patch, even though they are not used.

        I think committing this could wait until LUCENE-2454 is committed, and then that code can be changed to use prevSetBit() together with this.

        Show
        Paul Elschot added a comment - BitUtil.nlz() and the performance test method (renamed to tstPerfNlz()) are still in the patch, even though they are not used. I think committing this could wait until LUCENE-2454 is committed, and then that code can be changed to use prevSetBit() together with this.
        Hide
        Paul Elschot added a comment -

        Corrected mixing up the digits of the issue number.

        Show
        Paul Elschot added a comment - Corrected mixing up the digits of the issue number.
        Hide
        Michael McCandless added a comment -

        I think we should just commit this? It's a useful API.

        LUCENE-3171 (alternative nested docs impl w/ single pass collector) also could use this.

        Show
        Michael McCandless added a comment - I think we should just commit this? It's a useful API. LUCENE-3171 (alternative nested docs impl w/ single pass collector) also could use this.
        Hide
        Michael McCandless added a comment -

        Patch, just fixes some whitespace issues and adds CHANGES entry. I think it's ready to commit!

        Show
        Michael McCandless added a comment - Patch, just fixes some whitespace issues and adds CHANGES entry. I think it's ready to commit!
        Hide
        Michael McCandless added a comment -

        Thanks Paul!

        Show
        Michael McCandless added a comment - Thanks Paul!
        Hide
        Uwe Schindler added a comment -

        The testcase for prevSetBit has a bug, that was found by testing with java 5. It assumes that the allocation strategy for BitSet and OpenBitSet is identical, which it is not. E.g. Java 5's new BitSet(0) allocates still one word, while OpenBitSet does not.

        The attached patch fixes the issue.

        Show
        Uwe Schindler added a comment - The testcase for prevSetBit has a bug, that was found by testing with java 5. It assumes that the allocation strategy for BitSet and OpenBitSet is identical, which it is not. E.g. Java 5's new BitSet(0) allocates still one word, while OpenBitSet does not. The attached patch fixes the issue.
        Hide
        Uwe Schindler added a comment -

        Yonik mentioned on mailing list that prevSetBit is broken for size==0 and also indexes >= size. In that case you always get AIOOBE or even wrong results. In the case of an index >= the length of the bitset, the scanning must start at the last possible bit, so subIndex must be 0x3f and not simply the anded bits.

        This is my naive fix. Tests pass (I added a extra check to the test that start beyond end of bitset to check prevSetBit).

        Show
        Uwe Schindler added a comment - Yonik mentioned on mailing list that prevSetBit is broken for size==0 and also indexes >= size. In that case you always get AIOOBE or even wrong results. In the case of an index >= the length of the bitset, the scanning must start at the last possible bit, so subIndex must be 0x3f and not simply the anded bits. This is my naive fix. Tests pass (I added a extra check to the test that start beyond end of bitset to check prevSetBit).
        Hide
        Uwe Schindler added a comment -

        The check for negative indexes must be done to make the following loop work (which is standard to iterate backwards from "startBit" on all bits):

        for (int i = bs.prevSetBit(startBit); i >= 0; i = bs.prevSetBit(i-1)) {
             // operate on index i here
        }
        

        This would fail with AIOOBE when i=0 on the last iteration (happens if 0th bit is set), because bs.prevSetBit(i-1) has negative parameter. The exit condition is checked later, so -1 must be allowed.

        Show
        Uwe Schindler added a comment - The check for negative indexes must be done to make the following loop work (which is standard to iterate backwards from "startBit" on all bits): for ( int i = bs.prevSetBit(startBit); i >= 0; i = bs.prevSetBit(i-1)) { // operate on index i here } This would fail with AIOOBE when i=0 on the last iteration (happens if 0th bit is set), because bs.prevSetBit(i-1) has negative parameter. The exit condition is checked later, so -1 must be allowed.
        Hide
        Uwe Schindler added a comment -

        Modified patch. I moved the assignment to the "word" variable to also inside the if/else branch, as for the beyond-last-bit case we can optimize to not shift at all.

        Show
        Uwe Schindler added a comment - Modified patch. I moved the assignment to the "word" variable to also inside the if/else branch, as for the beyond-last-bit case we can optimize to not shift at all.
        Hide
        Paul Elschot added a comment -

        The 3179-fix patch looks good to me.
        I remember I had some doubts about which bit was actually the last one, and stopped worrying about it when the tests passed.
        This patch makes it very clear what the last bit is.

        Show
        Paul Elschot added a comment - The 3179-fix patch looks good to me. I remember I had some doubts about which bit was actually the last one, and stopped worrying about it when the tests passed. This patch makes it very clear what the last bit is.
        Hide
        Yonik Seeley added a comment -

        +1, patch looks good Uwe.

        Show
        Yonik Seeley added a comment - +1, patch looks good Uwe.
        Hide
        Uwe Schindler added a comment -

        Committed 3.x branch revision: 1139430
        Committed trunk revision: 1139431
        Committed 3.3 branch revision: 1139433

        Thanks Yonik!

        Show
        Uwe Schindler added a comment - Committed 3.x branch revision: 1139430 Committed trunk revision: 1139431 Committed 3.3 branch revision: 1139433 Thanks Yonik!
        Hide
        Uwe Schindler added a comment -

        One more comment: When working on the code, the symmetry all other methods have between long and int is broken here. For consistency we should add the long method, too. I just don't like the missing consistency.

        Also: OpenBitSet.nextSetBit() does not use Long.numberOfTrailingZeroes() but the new prevSetBit() does. As both methods have intrinsics, why only use one of them? Yonik?

        Any comments?

        Show
        Uwe Schindler added a comment - One more comment: When working on the code, the symmetry all other methods have between long and int is broken here. For consistency we should add the long method, too. I just don't like the missing consistency. Also: OpenBitSet.nextSetBit() does not use Long.numberOfTrailingZeroes() but the new prevSetBit() does. As both methods have intrinsics, why only use one of them? Yonik? Any comments?
        Hide
        Michael McCandless added a comment -

        Thanks for fixing these Uwe!

        I actually don't like how "generic" OBS has become... ie, that all methods have an int and long version, that the OBS doesn't "know" how many bits it holds (I added this field recently, but only for assertions), that some methods "grow" the number of bits and others don't, some methods accept out-of-bounds indices (negative and > numBits), etc. I think it's grown to accommodate too many users.... but I'm not sure what we should do to fix this. Maybe factor out (yet another) bit set impl that doesn't grow, knows its number of bits, has these fast getNext/getPrev set bit methods, operates only on int indices, etc.

        Show
        Michael McCandless added a comment - Thanks for fixing these Uwe! I actually don't like how "generic" OBS has become... ie, that all methods have an int and long version, that the OBS doesn't "know" how many bits it holds (I added this field recently, but only for assertions), that some methods "grow" the number of bits and others don't, some methods accept out-of-bounds indices (negative and > numBits), etc. I think it's grown to accommodate too many users.... but I'm not sure what we should do to fix this. Maybe factor out (yet another) bit set impl that doesn't grow, knows its number of bits, has these fast getNext/getPrev set bit methods, operates only on int indices, etc.
        Hide
        Michael McCandless added a comment -

        One more comment: When working on the code, the symmetry all other methods have between long and int is broken here. For consistency we should add the long method, too. I just don't like the missing consistency.

        I think we should add the long version, for consistency.

        Also: OpenBitSet.nextSetBit() does not use Long.numberOfTrailingZeroes() but the new prevSetBit() does. As both methods have intrinsics, why only use one of them? Yonik?

        Good question! In testing on this issue, above, Dawid and Paul found the intrinsics were faster on modern JREs... seems like nextSetBit should cutover too?

        Show
        Michael McCandless added a comment - One more comment: When working on the code, the symmetry all other methods have between long and int is broken here. For consistency we should add the long method, too. I just don't like the missing consistency. I think we should add the long version, for consistency. Also: OpenBitSet.nextSetBit() does not use Long.numberOfTrailingZeroes() but the new prevSetBit() does. As both methods have intrinsics, why only use one of them? Yonik? Good question! In testing on this issue, above, Dawid and Paul found the intrinsics were faster on modern JREs... seems like nextSetBit should cutover too?
        Hide
        Uwe Schindler added a comment -

        Here the patch with the long version and Long.numberOfTrailingZeroes() instead of BitUtils.ntz().

        Path was already available on my checkout. We should only also test the long versions (according to Clover all of them are not really tested).

        Show
        Uwe Schindler added a comment - Here the patch with the long version and Long.numberOfTrailingZeroes() instead of BitUtils.ntz(). Path was already available on my checkout. We should only also test the long versions (according to Clover all of them are not really tested).
        Hide
        Uwe Schindler added a comment -

        New patch that also improves tests to check all uncovered long methods (of course the indexes are still < Integer.MAX_VALUE(.

        Show
        Uwe Schindler added a comment - New patch that also improves tests to check all uncovered long methods (of course the indexes are still < Integer.MAX_VALUE(.
        Hide
        Michael McCandless added a comment -

        Patch looks good Uwe – thanks!

        Show
        Michael McCandless added a comment - Patch looks good Uwe – thanks!
        Hide
        Uwe Schindler added a comment -

        Any other comments/microbenchmarks from other committers? Dawid and Paul?

        I would like to commit this if nobody objects! What should we do with the then obsolete BitUtils methods?

        Show
        Uwe Schindler added a comment - Any other comments/microbenchmarks from other committers? Dawid and Paul? I would like to commit this if nobody objects! What should we do with the then obsolete BitUtils methods?
        Hide
        Robert Muir added a comment -

        bulk close for 3.3

        Show
        Robert Muir added a comment - bulk close for 3.3
        Hide
        Uwe Schindler added a comment -

        Committed long versions and additional tests: rev 1143558 (trunk), rev 1143560 (3.x).

        I did not commit the cutover to Long.numberOfLeadingZeroes, because it was not performance tested. Also from the use-case, on machines without intrinsics, the JDK-given methods are slower (see comments in BitUtils.ntz, as in most cases the bits are shifted away (in nextSetBit), so the faster algorithm is to inverse the algorithm when calculating ntz.

        Show
        Uwe Schindler added a comment - Committed long versions and additional tests: rev 1143558 (trunk), rev 1143560 (3.x). I did not commit the cutover to Long.numberOfLeadingZeroes, because it was not performance tested. Also from the use-case, on machines without intrinsics, the JDK-given methods are slower (see comments in BitUtils.ntz, as in most cases the bits are shifted away (in nextSetBit), so the faster algorithm is to inverse the algorithm when calculating ntz.

          People

          • Assignee:
            Paul Elschot
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development