Lucene - Core
  1. Lucene - Core
  2. LUCENE-3654

Optimize BytesRef comparator to use Unsafe long based comparison (when possible)

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index, core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Inspire by Google Guava UnsignedBytes lexi comparator, that uses unsafe to do long based comparisons over the bytes instead of one by one (which yields 2-4x better perf), use similar logic in BytesRef comparator. The code was adapted to support offset/length.

        Activity

        Hide
        Robert Muir added a comment -

        Sorry, i'm totally against the change, even with safety checks.

        I think this will hurt the reputation of project, and i think it will be a nightmare for developers too (Sorry, i dont want to debug avoidable jvm crashes).
        And I don't want to see Lucene start using unsafe everywhere. This is lucene-java, things like bounds checking are part of the language.

        Show
        Robert Muir added a comment - Sorry, i'm totally against the change, even with safety checks. I think this will hurt the reputation of project, and i think it will be a nightmare for developers too (Sorry, i dont want to debug avoidable jvm crashes). And I don't want to see Lucene start using unsafe everywhere. This is lucene-java, things like bounds checking are part of the language.
        Hide
        Uwe Schindler added a comment -

        can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no?

        We can put the whole comparator to contrib and BytesRef can have a static setter to change the default impl. Or we use SPI for it (contrib exports it in META-INF)

        Show
        Uwe Schindler added a comment - can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no? We can put the whole comparator to contrib and BytesRef can have a static setter to change the default impl. Or we use SPI for it (contrib exports it in META-INF)
        Hide
        Uwe Schindler added a comment - - edited

        The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset>=0 and offset+length<=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods.

        EDIT
        You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the public non-final fields in BytesRef and cause SIGSEGV). BytesRef is a user-visible class so it must be 100% safe against all usage-violations.

        Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter.

        If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava.

        Show
        Uwe Schindler added a comment - - edited The SIGSEGV can be solved by doing some safety checks at the beginning of compare: check that offset>=0 and offset+length<=bytes.length. If you use Unsafe, you have to make sure that your parameters are 1000% correct, that's all. This is why java.nio does lots of checks in their Buffer methods. EDIT You also have to copy offset, length and the actual byte[] reference to a local variable at the beginning and before the bounds checks (because otherwise another thread could change the public non-final fields in BytesRef and cause SIGSEGV). BytesRef is a user-visible class so it must be 100% safe against all usage-violations. Based on this additional overhead, the whole comparator makes no sense except for terms with a size of 200 bytes. But Lucene terms are in 99% of all cases shorter. If you want to use this comparator, just subclass Lucene40Codec and return it as term comparator, this can be completely outside Lucene. You can even use Guava.
        Hide
        Simon Willnauer added a comment -

        can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no?

        Show
        Simon Willnauer added a comment - can we have some -DXX:LuceneUseUnsafe option to enable this. I mean there are two camps here and that could make everybody happy? I mean if you use this option you have to expect possible problems no?
        Hide
        Robert Muir added a comment -

        Here's an example, since so much of the lucene codebase has bugs with bytesref offsets, i figure its a good example:

          public void testOops() {
            BytesRef b = new BytesRef("abcdefghijklmnop");
            b.offset = -545454544; // some bug, integer overflows and goes negative or other problem
            System.out.println(b.compareTo(new BytesRef("abcdefghijklmnop")));
          }
        

        With this patch, this gives me a SIGSEGV:

        junit-sequential:
            [junit] Testsuite: org.apache.lucene.util.TestBytesRef
            [junit] #
            [junit] # A fatal error has been detected by the Java Runtime Environment:
            [junit] #
            [junit] #  SIGSEGV (0xb) at pc=0x00007f386e7dcf64, pid=6093, tid=139880338200320
            [junit] #
            [junit] # JRE version: 6.0_24-b07
            [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode linux-amd64 compressed oops)
            [junit] # Problematic frame:
            [junit] # V  [libjvm.so+0x76ef64]
            [junit] #
        
        Show
        Robert Muir added a comment - Here's an example, since so much of the lucene codebase has bugs with bytesref offsets, i figure its a good example: public void testOops() { BytesRef b = new BytesRef("abcdefghijklmnop"); b.offset = -545454544; // some bug, integer overflows and goes negative or other problem System.out.println(b.compareTo(new BytesRef("abcdefghijklmnop"))); } With this patch, this gives me a SIGSEGV: junit-sequential: [junit] Testsuite: org.apache.lucene.util.TestBytesRef [junit] # [junit] # A fatal error has been detected by the Java Runtime Environment: [junit] # [junit] # SIGSEGV (0xb) at pc=0x00007f386e7dcf64, pid=6093, tid=139880338200320 [junit] # [junit] # JRE version: 6.0_24-b07 [junit] # Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode linux-amd64 compressed oops) [junit] # Problematic frame: [junit] # V [libjvm.so+0x76ef64] [junit] #
        Hide
        Robert Muir added a comment -

        the reason I am -1, I don't want JVM crashes.

        This is lucene java, users can expect not to have JVM crashes because of bytesref
        bugs in lucene (this class is used all over the place), they shoudl get AIOOBE and NPE
        and other things.

        So all is not fine just because it has a fallback.

        Convincing that there is performance win is a waste of time, this method is not a hotspot.
        Convincing me that nobody will get jvm crashes is going to be difficult.

        Show
        Robert Muir added a comment - the reason I am -1, I don't want JVM crashes. This is lucene java, users can expect not to have JVM crashes because of bytesref bugs in lucene (this class is used all over the place), they shoudl get AIOOBE and NPE and other things. So all is not fine just because it has a fallback. Convincing that there is performance win is a waste of time, this method is not a hotspot. Convincing me that nobody will get jvm crashes is going to be difficult.
        Hide
        Uwe Schindler added a comment -

        I agree here, but before doing this, I want some non-micro-benchmarks to show the effect. If there is no real effect, don't do it. Inside Lucene the comparator is not so often used (mostly only in indexer/BytesRefHash) and in TermRangeQuery. The other use cases are asserts all over the place, but they don't count.

        I would agree to the patch if the class would be renamed to something like UnsignedBytesComparator and the part importing sun.misc.Unsafe to be outside the main compilation unit. So if somebody compiles with a strange JVM like Harmony (although its dead) and sun.misc.Unsafe is not available, the build succeeds. The code in BytesRef is using reflection to load the comparator implementation, so all is fine, it would just get ClassNotFoundEx and fallback to the Java one. I could help in doing the ANT magic.

        Show
        Uwe Schindler added a comment - I agree here, but before doing this, I want some non-micro-benchmarks to show the effect. If there is no real effect, don't do it. Inside Lucene the comparator is not so often used (mostly only in indexer/BytesRefHash) and in TermRangeQuery. The other use cases are asserts all over the place, but they don't count. I would agree to the patch if the class would be renamed to something like UnsignedBytesComparator and the part importing sun.misc.Unsafe to be outside the main compilation unit. So if somebody compiles with a strange JVM like Harmony (although its dead) and sun.misc.Unsafe is not available, the build succeeds. The code in BytesRef is using reflection to load the comparator implementation, so all is fine, it would just get ClassNotFoundEx and fallback to the Java one. I could help in doing the ANT magic.
        Hide
        Dawid Weiss added a comment -

        There's been an interesting discussion about the common use of Unsafe on hotspot mailing list recently (can't recall the thread now though). Some people even wanted Unsafe to become part of standard library (not unsafe accesses – the lock checking part, but nonetheless). This guy wrote an entire off-heap collections library on top of Unsafe:

        http://www.ohloh.net/p/java-huge-collections

        I think using Unsafe with a fallback is fine, especially in small-scope methods that are used frequently and can be thoroughly tested. BytesRef is such an example to me.

        This said, it would certainly help to convince Robert and others if you ran benchmarks alongside with and without Unsafe and show how much there is to gain, Shay.

        Show
        Dawid Weiss added a comment - There's been an interesting discussion about the common use of Unsafe on hotspot mailing list recently (can't recall the thread now though). Some people even wanted Unsafe to become part of standard library (not unsafe accesses – the lock checking part, but nonetheless). This guy wrote an entire off-heap collections library on top of Unsafe: http://www.ohloh.net/p/java-huge-collections I think using Unsafe with a fallback is fine, especially in small-scope methods that are used frequently and can be thoroughly tested. BytesRef is such an example to me. This said, it would certainly help to convince Robert and others if you ran benchmarks alongside with and without Unsafe and show how much there is to gain, Shay.
        Hide
        Jason Rutherglen added a comment -

        +1 There are 3 other MAJOR Apache projects that have already integrated this efficiency. It's completely silly not to use it.

        Show
        Jason Rutherglen added a comment - +1 There are 3 other MAJOR Apache projects that have already integrated this efficiency. It's completely silly not to use it.
        Hide
        Shay Banon added a comment -

        ok, so Mr Muir is -1, if people/devs still want to see it in Lucene, tell me and I can try and work on making it more Committable based on Uwe comments.

        Show
        Shay Banon added a comment - ok, so Mr Muir is -1, if people/devs still want to see it in Lucene, tell me and I can try and work on making it more Committable based on Uwe comments.
        Hide
        Robert Muir added a comment -

        Its not like mmapdirectory at all, there we are basically doing what
        we must as a workaround to one of the top bugs at bugs.sun.com to reduce
        number of open files and transient disk usage. It seems like a reasonable
        calculated risk, but its still terrible.

        I know you want to discuss this, but this is so ridiculous that I'm not
        even going to have the conversation. I'm not going to change my vote here.

        Show
        Robert Muir added a comment - Its not like mmapdirectory at all, there we are basically doing what we must as a workaround to one of the top bugs at bugs.sun.com to reduce number of open files and transient disk usage. It seems like a reasonable calculated risk, but its still terrible. I know you want to discuss this, but this is so ridiculous that I'm not even going to have the conversation. I'm not going to change my vote here.
        Hide
        Uwe Schindler added a comment -

        What do we have to lose, it's like with MMapDirectory, a few tests on startup and setting a reference to the improved comparator and done? Indeed we need compile-time checking (import of the Unsafe class), but if there are incompatible changes we will see on runtime linking and automatically fall back to the Java-Only comparator.

        Show
        Uwe Schindler added a comment - What do we have to lose, it's like with MMapDirectory, a few tests on startup and setting a reference to the improved comparator and done? Indeed we need compile-time checking (import of the Unsafe class), but if there are incompatible changes we will see on runtime linking and automatically fall back to the Java-Only comparator.
        Hide
        Robert Muir added a comment -

        where is the hotspot that requires this?

        If there is many comparisons being performed somewhere, we should use a better datastructure instead.

        My vote: -1

        We have everything to lose and little to gain.

        Show
        Robert Muir added a comment - where is the hotspot that requires this? If there is many comparisons being performed somewhere, we should use a better datastructure instead. My vote: -1 We have everything to lose and little to gain.
        Hide
        Uwe Schindler added a comment -

        Nice, I remember Jason brought this up!

        My problem is only: sun.misc.Unsafe is undocumented, so it might not be available on compile time. Can we handle that somehow without reflection, maybe in a separate file where we dont fail if the javac ant task fails? I know, Unsafe is always available, but I dont want to rely on it - maybe someone wants to compile our code with SomeStrangeJDK™.
        The other thing is the inner class holder. Can we rename it to not contain "lexicographical", maybe UnsignedByteComparatorHolder? Lucene works on byte[] that can be anything, even binary terms.

        Show
        Uwe Schindler added a comment - Nice, I remember Jason brought this up! My problem is only: sun.misc.Unsafe is undocumented, so it might not be available on compile time. Can we handle that somehow without reflection, maybe in a separate file where we dont fail if the javac ant task fails? I know, Unsafe is always available, but I dont want to rely on it - maybe someone wants to compile our code with SomeStrangeJDK™. The other thing is the inner class holder. Can we rename it to not contain "lexicographical", maybe UnsignedByteComparatorHolder? Lucene works on byte[] that can be anything, even binary terms.
        Hide
        Jason Rutherglen added a comment -

        Nice, I mentioned this on the dev list over a month ago (originally it was mentioned on the Hadoop list), nice to see it get into Lucene, though am curious where the speed improvement will be for Lucene.

        Show
        Jason Rutherglen added a comment - Nice, I mentioned this on the dev list over a month ago (originally it was mentioned on the Hadoop list), nice to see it get into Lucene, though am curious where the speed improvement will be for Lucene.
        Hide
        Shay Banon added a comment -

        Cool.

        p.s. I formatted it using the IDEA formatting template provided in Lucene

        Show
        Shay Banon added a comment - Cool. p.s. I formatted it using the IDEA formatting template provided in Lucene
        Hide
        Simon Willnauer added a comment -

        Shay, thanks for doing this. Somebody mentioned this on the list earlier. I will take your patch and remove all the formatting if you don't mind. I think this makes lots of sense especially if you have really long references.

        Show
        Simon Willnauer added a comment - Shay, thanks for doing this. Somebody mentioned this on the list earlier. I will take your patch and remove all the formatting if you don't mind. I think this makes lots of sense especially if you have really long references.
        Hide
        Shay Banon added a comment -

        Patch attached, all tests seem to pass.

        Show
        Shay Banon added a comment - Patch attached, all tests seem to pass.

          People

          • Assignee:
            Unassigned
            Reporter:
            Shay Banon
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development