Lucene - Core
  1. Lucene - Core
  2. LUCENE-4889

UnicodeUtil.codePointCount microbenchmarks (wtf)

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, Trunk
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is interesting. I posted a link to a state-machine-based UTF8 parser/recognizer:
      http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

      I spent some time thinking if the lookup table could be converted into a stateless computational function, which would avoid a table lookup (which in Java will cause an additional bounds check that will be hard to eliminate I think). This didn't turn out to be easy (it boils down to finding a simple function that would map a set of integers to its concrete permutation; a generalization of minimal perfect hashing).

      But out of curiosity I though it'd be fun to compare how Lucene's codepoint counting compares to Java's built-in one (Decoder) and a sequence of if's.

      I've put together a Caliper benchmark that processes 50 million unicode codepoints; one only ASCII, one Unicode. The results are interesting. On my win/I7:

       implementation dataType          ns linear runtime
               LUCENE  UNICODE 167359502.6 ===============
               LUCENE    ASCII 334015746.5 ==============================
      NOLOOKUP_SWITCH  UNICODE 154294141.8 =============
      NOLOOKUP_SWITCH    ASCII 119500892.8 ==========
          NOLOOKUP_IF  UNICODE  90149072.6 ========
          NOLOOKUP_IF    ASCII  29151411.4 ==
      

      Disregard the switch lookup – it's for fun only. But a sequence of if's is significantly faster than the current Lucene's table lookup, especially on ASCII input. And now compare this to Java's built-in decoder...

                 JAVA  UNICODE   5753930.1 =
                 JAVA    ASCII        23.8 =
      

      Yes, it's the same benchmark. Wtf? I realize buffers are partially native and probably so is utf8 decoder but by so much?! Again, to put it in context:

       implementation dataType          ns linear runtime
               LUCENE  UNICODE 167359502.6 ===============
               LUCENE    ASCII 334015746.5 ==============================
                 JAVA  UNICODE   5753930.1 =
                 JAVA    ASCII        23.8 =
          NOLOOKUP_IF  UNICODE  90149072.6 ========
          NOLOOKUP_IF    ASCII  29151411.4 ==
      NOLOOKUP_SWITCH  UNICODE 154294141.8 =============
      NOLOOKUP_SWITCH    ASCII 119500892.8 ==========
      

      Wtf? The code is here if you want to experiment.
      https://github.com/dweiss/utf8dfa

      I realize the Java version needs to allocate a temporary space buffer but if these numbers hold for different VMs it may actually be worth it...

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        22h 14m 1 Dawid Weiss 28/Mar/13 08:45
        Resolved Resolved Closed Closed
        43d 1h 48m 1 Uwe Schindler 10/May/13 11:33
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Dawid Weiss made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 4.3 [ 12324143 ]
        Resolution Fixed [ 1 ]
        Hide
        Dawid Weiss added a comment -

        http://i.stack.imgur.com/jiFfM.jpg [Double facepalm]

        It couldn't be right, it was surreal. I checked the code again and indeed, there was a subtle bug in the Java code – it was looping over the decode loop, all right, but it was never rewiding the input buffer after the first time, damn it. Corrected it shows sensible output:

        implementation dataType    ms linear runtime
        [current lucene]
                LUCENE  UNICODE 167.3 ==========
                LUCENE    ASCII 333.9 =====================
        [patch]
           LUCENE_MOD1  UNICODE 103.0 ======
           LUCENE_MOD1    ASCII  77.2 ====
        [if-based version but without assertions]
           NOLOOKUP_IF  UNICODE  90.2 =====
           NOLOOKUP_IF    ASCII  29.1 =
        [java version]
                  JAVA  UNICODE 465.6 ==============================
                  JAVA    ASCII 103.1 ======
        [no branching/counting, just pass over data]
              NO_COUNT  UNICODE  52.1 ===
              NO_COUNT    ASCII  26.0 =
        

        I also did a non-benchmark loop in which everything is just counted 10 times.

        time          codepoints   version data set
             1.676 <= [ 500000010] UNICODE LUCENE
             0.905 <= [ 500000010] UNICODE LUCENE_MOD1
             0.905 <= [ 500000010] UNICODE NOLOOKUP_IF
             4.686 <= [ 500000010] UNICODE JAVA
        
             3.339 <= [1000000000] ASCII LUCENE
             1.028 <= [1000000000] ASCII LUCENE_MOD1
             1.027 <= [1000000000] ASCII NOLOOKUP_IF
             1.591 <= [1000000000] ASCII JAVA
        

        I'll commit the patch since it improves both the speed and the internal validation logic.

        Show
        Dawid Weiss added a comment - http://i.stack.imgur.com/jiFfM.jpg [Double facepalm] It couldn't be right, it was surreal. I checked the code again and indeed, there was a subtle bug in the Java code – it was looping over the decode loop, all right, but it was never rewiding the input buffer after the first time, damn it. Corrected it shows sensible output: implementation dataType ms linear runtime [current lucene] LUCENE UNICODE 167.3 ========== LUCENE ASCII 333.9 ===================== [patch] LUCENE_MOD1 UNICODE 103.0 ====== LUCENE_MOD1 ASCII 77.2 ==== [ if -based version but without assertions] NOLOOKUP_IF UNICODE 90.2 ===== NOLOOKUP_IF ASCII 29.1 = [java version] JAVA UNICODE 465.6 ============================== JAVA ASCII 103.1 ====== [no branching/counting, just pass over data] NO_COUNT UNICODE 52.1 === NO_COUNT ASCII 26.0 = I also did a non-benchmark loop in which everything is just counted 10 times. time codepoints version data set 1.676 <= [ 500000010] UNICODE LUCENE 0.905 <= [ 500000010] UNICODE LUCENE_MOD1 0.905 <= [ 500000010] UNICODE NOLOOKUP_IF 4.686 <= [ 500000010] UNICODE JAVA 3.339 <= [1000000000] ASCII LUCENE 1.028 <= [1000000000] ASCII LUCENE_MOD1 1.027 <= [1000000000] ASCII NOLOOKUP_IF 1.591 <= [1000000000] ASCII JAVA I'll commit the patch since it improves both the speed and the internal validation logic.
        Dawid Weiss made changes -
        Field Original Value New Value
        Attachment LUCENE-4889.patch [ 12575694 ]
        Hide
        Dawid Weiss added a comment -

        A patch cleaning up codePointCount. The state array is still used in UTF8toUTF32 so I left it in, but replaced -1's with Integer.MIN_VALUE (not that it makes a difference here).

        Show
        Dawid Weiss added a comment - A patch cleaning up codePointCount. The state array is still used in UTF8toUTF32 so I left it in, but replaced -1's with Integer.MIN_VALUE (not that it makes a difference here).
        Hide
        Michael McCandless added a comment -

        I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input

        +1 ... spinning forever on bad input is not nice!

        Show
        Michael McCandless added a comment - I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input +1 ... spinning forever on bad input is not nice!
        Hide
        Dawid Weiss added a comment -

        I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input .

        Mike can run his magic perf. scripts and we'll see if it breaks anything. I doubt.

        Show
        Dawid Weiss added a comment - I'll actually prepare a patch that replaces the current implementation of UnicodeUtil.codePointCount because it can spin forever on invalid input (yes it does say the behavior is undefined but I think we should just throw an exception on invalid input . Mike can run his magic perf. scripts and we'll see if it breaks anything. I doubt.
        Hide
        Robert Muir added a comment -

        I think we should change the counting function... its a little scary in that i think it will loop forever on bad input?

        We could at least fix the table to not use -1 but a large negative value that stands a chance of creating AIOOBE

        Show
        Robert Muir added a comment - I think we should change the counting function... its a little scary in that i think it will loop forever on bad input? We could at least fix the table to not use -1 but a large negative value that stands a chance of creating AIOOBE
        Hide
        Dawid Weiss added a comment -

        Just pushed a version that doesn't do a table lookup for ASCII.

        implementation dataType          ns linear runtime
                LUCENE  UNICODE 167374240.6 ===============
                LUCENE    ASCII 333944799.0 ==============================
           LUCENE_MOD1  UNICODE 167449028.1 ===============
           LUCENE_MOD1    ASCII  77172139.4 ======
                  JAVA  UNICODE   5755140.1 =
                  JAVA    ASCII        23.9 =
           NOLOOKUP_IF  UNICODE  90220440.6 ========
           NOLOOKUP_IF    ASCII  29145155.3 ==
        

        Cuts the time by 75% but still far above the Java decoder. So it's not a single loop I think, Uwe

        Also: I'm not comparing full decoding, only codepoint counting. I also assumed valid utf8 (since it's what UnicodeUtil does anyway). Finally: I'm not advocating for changing it, I'm just saying it's interesting by how much these timings differ.

        Show
        Dawid Weiss added a comment - Just pushed a version that doesn't do a table lookup for ASCII. implementation dataType ns linear runtime LUCENE UNICODE 167374240.6 =============== LUCENE ASCII 333944799.0 ============================== LUCENE_MOD1 UNICODE 167449028.1 =============== LUCENE_MOD1 ASCII 77172139.4 ====== JAVA UNICODE 5755140.1 = JAVA ASCII 23.9 = NOLOOKUP_IF UNICODE 90220440.6 ======== NOLOOKUP_IF ASCII 29145155.3 == Cuts the time by 75% but still far above the Java decoder. So it's not a single loop I think, Uwe Also: I'm not comparing full decoding, only codepoint counting. I also assumed valid utf8 (since it's what UnicodeUtil does anyway). Finally: I'm not advocating for changing it, I'm just saying it's interesting by how much these timings differ.
        Hide
        Robert Muir added a comment -

        But the jdk decoder is buggy. Its easy to make a buggy fast decoder

        Show
        Robert Muir added a comment - But the jdk decoder is buggy. Its easy to make a buggy fast decoder
        Hide
        Uwe Schindler added a comment -

        I think the ASCII-only performance is caused by line 211/212 and 632/633 in http://www.docjar.com/html/api/sun/nio/cs/UTF_8.java.html
        If your buffer is only ASCII is has a very simple loop that simply copies the bytes. This may explain the speed, together with eliminated bounds checks by hotspot on this simple loop.

        Show
        Uwe Schindler added a comment - I think the ASCII-only performance is caused by line 211/212 and 632/633 in http://www.docjar.com/html/api/sun/nio/cs/UTF_8.java.html If your buffer is only ASCII is has a very simple loop that simply copies the bytes. This may explain the speed, together with eliminated bounds checks by hotspot on this simple loop.
        Hide
        Dawid Weiss added a comment -

        Just to complete: didn't inspect jit assembly dumps, didn't check for dead code elimination (although I don't think it should happen here because of how Caliper is written).

        Show
        Dawid Weiss added a comment - Just to complete: didn't inspect jit assembly dumps, didn't check for dead code elimination (although I don't think it should happen here because of how Caliper is written).
        Dawid Weiss created issue -

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development