Lucene - Core
  1. Lucene - Core
  2. LUCENE-4889

UnicodeUtil.codePointCount microbenchmarks (wtf)

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 5.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is interesting. I posted a link to a state-machine-based UTF8 parser/recognizer:
      http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

      I spent some time thinking if the lookup table could be converted into a stateless computational function, which would avoid a table lookup (which in Java will cause an additional bounds check that will be hard to eliminate I think). This didn't turn out to be easy (it boils down to finding a simple function that would map a set of integers to its concrete permutation; a generalization of minimal perfect hashing).

      But out of curiosity I though it'd be fun to compare how Lucene's codepoint counting compares to Java's built-in one (Decoder) and a sequence of if's.

      I've put together a Caliper benchmark that processes 50 million unicode codepoints; one only ASCII, one Unicode. The results are interesting. On my win/I7:

       implementation dataType          ns linear runtime
               LUCENE  UNICODE 167359502.6 ===============
               LUCENE    ASCII 334015746.5 ==============================
      NOLOOKUP_SWITCH  UNICODE 154294141.8 =============
      NOLOOKUP_SWITCH    ASCII 119500892.8 ==========
          NOLOOKUP_IF  UNICODE  90149072.6 ========
          NOLOOKUP_IF    ASCII  29151411.4 ==
      

      Disregard the switch lookup – it's for fun only. But a sequence of if's is significantly faster than the current Lucene's table lookup, especially on ASCII input. And now compare this to Java's built-in decoder...

                 JAVA  UNICODE   5753930.1 =
                 JAVA    ASCII        23.8 =
      

      Yes, it's the same benchmark. Wtf? I realize buffers are partially native and probably so is utf8 decoder but by so much?! Again, to put it in context:

       implementation dataType          ns linear runtime
               LUCENE  UNICODE 167359502.6 ===============
               LUCENE    ASCII 334015746.5 ==============================
                 JAVA  UNICODE   5753930.1 =
                 JAVA    ASCII        23.8 =
          NOLOOKUP_IF  UNICODE  90149072.6 ========
          NOLOOKUP_IF    ASCII  29151411.4 ==
      NOLOOKUP_SWITCH  UNICODE 154294141.8 =============
      NOLOOKUP_SWITCH    ASCII 119500892.8 ==========
      

      Wtf? The code is here if you want to experiment.
      https://github.com/dweiss/utf8dfa

      I realize the Java version needs to allocate a temporary space buffer but if these numbers hold for different VMs it may actually be worth it...

        Activity

        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Hide
        Dawid Weiss added a comment -

        http://i.stack.imgur.com/jiFfM.jpg [Double facepalm]

        It couldn't be right, it was surreal. I checked the code again and indeed, there was a subtle bug in the Java code – it was looping over the decode loop, all right, but it was never rewiding the input buffer after the first time, damn it. Corrected it shows sensible output:

        implementation dataType    ms linear runtime
        [current lucene]
                LUCENE  UNICODE 167.3 ==========
                LUCENE    ASCII 333.9 =====================
        [patch]
           LUCENE_MOD1  UNICODE 103.0 ======
           LUCENE_MOD1    ASCII  77.2 ====
        [if-based version but without assertions]
           NOLOOKUP_IF  UNICODE  90.2 =====
           NOLOOKUP_IF    ASCII  29.1 =
        [java version]
                  JAVA  UNICODE 465.6 ==============================
                  JAVA    ASCII 103.1 ======
        [no branching/counting, just pass over data]
              NO_COUNT  UNICODE  52.1 ===
              NO_COUNT    ASCII  26.0 =
        

        I also did a non-benchmark loop in which everything is just counted 10 times.

        time          codepoints   version data set
             1.676 <= [ 500000010] UNICODE LUCENE
             0.905 <= [ 500000010] UNICODE LUCENE_MOD1
             0.905 <= [ 500000010] UNICODE NOLOOKUP_IF
             4.686 <= [ 500000010] UNICODE JAVA
        
             3.339 <= [1000000000] ASCII LUCENE
             1.028 <= [1000000000] ASCII LUCENE_MOD1
             1.027 <= [1000000000] ASCII NOLOOKUP_IF
             1.591 <= [1000000000] ASCII JAVA
        

        I'll commit the patch since it improves both the speed and the internal validation logic.

        Show
        Dawid Weiss added a comment - http://i.stack.imgur.com/jiFfM.jpg [Double facepalm] It couldn't be right, it was surreal. I checked the code again and indeed, there was a subtle bug in the Java code – it was looping over the decode loop, all right, but it was never rewiding the input buffer after the first time, damn it. Corrected it shows sensible output: implementation dataType ms linear runtime [current lucene] LUCENE UNICODE 167.3 ========== LUCENE ASCII 333.9 ===================== [patch] LUCENE_MOD1 UNICODE 103.0 ====== LUCENE_MOD1 ASCII 77.2 ==== [ if -based version but without assertions] NOLOOKUP_IF UNICODE 90.2 ===== NOLOOKUP_IF ASCII 29.1 = [java version] JAVA UNICODE 465.6 ============================== JAVA ASCII 103.1 ====== [no branching/counting, just pass over data] NO_COUNT UNICODE 52.1 === NO_COUNT ASCII 26.0 = I also did a non-benchmark loop in which everything is just counted 10 times. time codepoints version data set 1.676 <= [ 500000010] UNICODE LUCENE 0.905 <= [ 500000010] UNICODE LUCENE_MOD1 0.905 <= [ 500000010] UNICODE NOLOOKUP_IF 4.686 <= [ 500000010] UNICODE JAVA 3.339 <= [1000000000] ASCII LUCENE 1.028 <= [1000000000] ASCII LUCENE_MOD1 1.027 <= [1000000000] ASCII NOLOOKUP_IF 1.591 <= [1000000000] ASCII JAVA I'll commit the patch since it improves both the speed and the internal validation logic.
        Hide
        Dawid Weiss added a comment -

        A patch cleaning up codePointCount. The state array is still used in UTF8toUTF32 so I left it in, but replaced -1's with Integer.MIN_VALUE (not that it makes a difference here).

        Show
        Dawid Weiss added a comment - A patch cleaning up codePointCount. The state array is still used in UTF8toUTF32 so I left it in, but replaced -1's with Integer.MIN_VALUE (not that it makes a difference here).
        Hide
        Michael McCandless added a comment -

        I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input

        +1 ... spinning forever on bad input is not nice!

        Show
        Michael McCandless added a comment - I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input +1 ... spinning forever on bad input is not nice!
        Hide
        Dawid Weiss added a comment -

        I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input .

        Mike can run his magic perf. scripts and we'll see if it breaks anything. I doubt.

        Show
        Dawid Weiss added a comment - I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input . Mike can run his magic perf. scripts and we'll see if it breaks anything. I doubt.
        Hide
        Robert Muir added a comment -

        I think we should change the counting function... its a little scary in that i think it will loop forever on bad input?

        We could at least fix the table to not use -1 but a large negative value that stands a chance of creating AIOOBE

        Show
        Robert Muir added a comment - I think we should change the counting function... its a little scary in that i think it will loop forever on bad input? We could at least fix the table to not use -1 but a large negative value that stands a chance of creating AIOOBE
        Hide
        Dawid Weiss added a comment -

        Just pushed a version that doesn't do a table lookup for ASCII.

        implementation dataType          ns linear runtime
                LUCENE  UNICODE 167374240.6 ===============
                LUCENE    ASCII 333944799.0 ==============================
           LUCENE_MOD1  UNICODE 167449028.1 ===============
           LUCENE_MOD1    ASCII  77172139.4 ======
                  JAVA  UNICODE   5755140.1 =
                  JAVA    ASCII        23.9 =
           NOLOOKUP_IF  UNICODE  90220440.6 ========
           NOLOOKUP_IF    ASCII  29145155.3 ==
        

        Cuts the time by 75% but still far above the Java decoder. So it's not a single loop I think, Uwe

        Also: I'm not comparing full decoding, only codepoint counting. I also assumed valid utf8 (since it's what UnicodeUtil does anyway). Finally: I'm not advocating for changing it, I'm just saying it's interesting by how much these timings differ.

        Show
        Dawid Weiss added a comment - Just pushed a version that doesn't do a table lookup for ASCII. implementation dataType ns linear runtime LUCENE UNICODE 167374240.6 =============== LUCENE ASCII 333944799.0 ============================== LUCENE_MOD1 UNICODE 167449028.1 =============== LUCENE_MOD1 ASCII 77172139.4 ====== JAVA UNICODE 5755140.1 = JAVA ASCII 23.9 = NOLOOKUP_IF UNICODE 90220440.6 ======== NOLOOKUP_IF ASCII 29145155.3 == Cuts the time by 75% but still far above the Java decoder. So it's not a single loop I think, Uwe Also: I'm not comparing full decoding, only codepoint counting. I also assumed valid utf8 (since it's what UnicodeUtil does anyway). Finally: I'm not advocating for changing it, I'm just saying it's interesting by how much these timings differ.
        Hide
        Robert Muir added a comment -

        But the jdk decoder is buggy. Its easy to make a buggy fast decoder

        Show
        Robert Muir added a comment - But the jdk decoder is buggy. Its easy to make a buggy fast decoder
        Hide
        Uwe Schindler added a comment -

        I think the ASCII-only performance is caused by line 211/212 and 632/633 in http://www.docjar.com/html/api/sun/nio/cs/UTF_8.java.html
        If your buffer is only ASCII is has a very simple loop that simply copies the bytes. This may explain the speed, together with eliminated bounds checks by hotspot on this simple loop.

        Show
        Uwe Schindler added a comment - I think the ASCII-only performance is caused by line 211/212 and 632/633 in http://www.docjar.com/html/api/sun/nio/cs/UTF_8.java.html If your buffer is only ASCII is has a very simple loop that simply copies the bytes. This may explain the speed, together with eliminated bounds checks by hotspot on this simple loop.
        Hide
        Dawid Weiss added a comment -

        Just to complete: didn't inspect jit assembly dumps, didn't check for dead code elimination (although I don't think it should happen here because of how Caliper is written).

        Show
        Dawid Weiss added a comment - Just to complete: didn't inspect jit assembly dumps, didn't check for dead code elimination (although I don't think it should happen here because of how Caliper is written).

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development