Lucene - Core
  1. Lucene - Core
  2. LUCENE-2265

improve automaton performance by running on byte[]

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Currently, when enumerating terms, automaton must convert entire terms from flex's native utf-8 byte[] to char[] first, then step each char thru the state machine.

      we can make this more efficient, by allowing the state machine to run on byte[], so it can return true/false faster.

      1. LUCENE-2265_pare.patch
        12 kB
        Robert Muir
      2. LUCENE-2265_utf32.patch
        27 kB
        Robert Muir
      3. LUCENE-2265.patch
        205 kB
        Michael McCandless
      4. LUCENE-2265.patch
        194 kB
        Michael McCandless
      5. LUCENE-2265.patch
        194 kB
        Michael McCandless
      6. LUCENE-2265.patch
        90 kB
        Michael McCandless
      7. LUCENE-2265.patch
        90 kB
        Michael McCandless
      8. LUCENE-2265.patch
        89 kB
        Michael McCandless
      9. LUCENE-2265.patch
        89 kB
        Robert Muir
      10. LUCENE-2265.patch
        73 kB
        Robert Muir
      11. LUCENE-2265.patch
        27 kB
        Michael McCandless
      12. LUCENE-2265.patch
        26 kB
        Michael McCandless
      13. LUCENE-2265.patch
        6 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          attached is a patch. at most it only improves performance around 10% for Latin-1 text.
          I haven't benchmarked non-Latin test yet.

          Even if its better, I am not sure we want to do this (is the performance worth the complexity?), but all the tests pass.

          Show
          Robert Muir added a comment - attached is a patch. at most it only improves performance around 10% for Latin-1 text. I haven't benchmarked non-Latin test yet. Even if its better, I am not sure we want to do this (is the performance worth the complexity?), but all the tests pass.
          Hide
          Robert Muir added a comment -

          I tested this with Hindi text as well, it gives no benefit there. So it only helps english, will leave the patch out here in case someone has a better idea

          Show
          Robert Muir added a comment - I tested this with Hindi text as well, it gives no benefit there. So it only helps english, will leave the patch out here in case someone has a better idea
          Hide
          Robert Muir added a comment -

          I discussed this situation with Mike McCandless and I think we might have something of a plan.

          For reference, here is the problem:

          • In flex the terms are byte[] (typically UTF-8)
          • Automaton transitions work on UTF-16 intervals (char)
          • RunAutomaton is an array-compiled form that also works on UTF-16 (char[])
          • Because of this, we have a lot of unicode conversion overhead between byte[] and char[] hurting performance.

          Here is the current idea:

          • Switch Automaton to work on UTF-32 intervals (int)
          • Create a method to convert a UTF-32 Automaton to an equivalent UTF-8 Automaton.
          • Create a UTF-8 RunAutomaton that works on byte[]
          • We could also create a UTF-32 RunAutomaton that works on codepoints, for use in analysis, etc.

          This would have some nice benefits besides performance,
          for example a wildcard operator of "?" or regex operator of "." would match a real unicode codepoint,
          not a single code unit like it always has. So if somehow we can make this work, we would have better
          performance and better unicode support.

          The trick is to do this UTF-32 DFA -> UTF-8 DFA conversion efficiently, especially keeping determinism,
          and not causing some nasty explosion

          Show
          Robert Muir added a comment - I discussed this situation with Mike McCandless and I think we might have something of a plan. For reference, here is the problem: In flex the terms are byte[] (typically UTF-8) Automaton transitions work on UTF-16 intervals (char) RunAutomaton is an array-compiled form that also works on UTF-16 (char[]) Because of this, we have a lot of unicode conversion overhead between byte[] and char[] hurting performance. Here is the current idea: Switch Automaton to work on UTF-32 intervals (int) Create a method to convert a UTF-32 Automaton to an equivalent UTF-8 Automaton. Create a UTF-8 RunAutomaton that works on byte[] We could also create a UTF-32 RunAutomaton that works on codepoints, for use in analysis, etc. This would have some nice benefits besides performance, for example a wildcard operator of "?" or regex operator of "." would match a real unicode codepoint, not a single code unit like it always has. So if somehow we can make this work, we would have better performance and better unicode support. The trick is to do this UTF-32 DFA -> UTF-8 DFA conversion efficiently, especially keeping determinism, and not causing some nasty explosion
          Hide
          Earwin Burrfoot added a comment -

          I probably missed something.

          Why can't you create UTF-8 Automaton from the get go?

          Show
          Earwin Burrfoot added a comment - I probably missed something. Why can't you create UTF-8 Automaton from the get go?
          Hide
          Robert Muir added a comment -

          Why can't you create UTF-8 Automaton from the get go?

          Because high-level, users want automaton transitions to represent real characters
          (eg regular expressions, wildcards, etc) and do not much care about bytes!

          So the utf-16 Automaton/RunAutomaton pair makes sense for the library...

          But utf-32 is still easy to work with high-level (we just represent codepoint intervals instead of codeunit),
          and utf-8 is faster for working with lucene.

          Show
          Robert Muir added a comment - Why can't you create UTF-8 Automaton from the get go? Because high-level, users want automaton transitions to represent real characters (eg regular expressions, wildcards, etc) and do not much care about bytes! So the utf-16 Automaton/RunAutomaton pair makes sense for the library... But utf-32 is still easy to work with high-level (we just represent codepoint intervals instead of codeunit), and utf-8 is faster for working with lucene.
          Hide
          Earwin Burrfoot added a comment -

          So? You aren't making some generic automaton library, are you?
          Get these user regexes/wildcards/etc, convert them to utf-8, build utf-8 automaton, run it against lucene data.

          Show
          Earwin Burrfoot added a comment - So? You aren't making some generic automaton library, are you? Get these user regexes/wildcards/etc, convert them to utf-8, build utf-8 automaton, run it against lucene data.
          Hide
          Earwin Burrfoot added a comment -

          I mean, high-level, users don't care about your automaton at all, much less transitions. They want their regexes and wildcards to work.

          Show
          Earwin Burrfoot added a comment - I mean, high-level, users don't care about your automaton at all, much less transitions. They want their regexes and wildcards to work.
          Hide
          Robert Muir added a comment -

          So? You aren't making some generic automaton library, are you?
          Get these user regexes/wildcards/etc, convert them to utf-8, build utf-8 automaton, run it against lucene data.

          This just pushes the complexity into the parsers. and yes, it makes sense to support high-level (char[]) operations
          with automaton too, such as analysis.

          I encourage you to take a look at the existing code. In general a lot of parsers (see wildcard and regex) are implemented
          with primitive automata like 'makeAnyChar'. 'makeAnyByte' makes no sense.

          So its generic in the sense that fuzzy, regex, wildcard, all of our users are defined on unicode characters. high
          level operations such as parsing, intersection, and union belong in utf16 or utf32 space, not with bytes.

          bytes is an implementation detail, and we shouldnt operate on UTF-8 except behind the scenes.

          Show
          Robert Muir added a comment - So? You aren't making some generic automaton library, are you? Get these user regexes/wildcards/etc, convert them to utf-8, build utf-8 automaton, run it against lucene data. This just pushes the complexity into the parsers. and yes, it makes sense to support high-level (char[]) operations with automaton too, such as analysis. I encourage you to take a look at the existing code. In general a lot of parsers (see wildcard and regex) are implemented with primitive automata like 'makeAnyChar'. 'makeAnyByte' makes no sense. So its generic in the sense that fuzzy, regex, wildcard, all of our users are defined on unicode characters. high level operations such as parsing, intersection, and union belong in utf16 or utf32 space, not with bytes. bytes is an implementation detail, and we shouldnt operate on UTF-8 except behind the scenes.
          Hide
          Robert Muir added a comment -

          I mean, high-level, users don't care about your automaton at all, much less transitions. They want their regexes and wildcards to work.

          but you have it backwards. users want their regexes and wildcards to work. they want wildcard "?" or regex "." to match unicode
          characters, not bytes. no one cares about bytes.

          Show
          Robert Muir added a comment - I mean, high-level, users don't care about your automaton at all, much less transitions. They want their regexes and wildcards to work. but you have it backwards. users want their regexes and wildcards to work. they want wildcard "?" or regex "." to match unicode characters, not bytes. no one cares about bytes.
          Hide
          Earwin Burrfoot added a comment -

          Hmmmmm.

          I'd say that your highlevel operations like intersection and union remain exactly the same regardless of the alphabet you're operating on.
          The primitive automata, like AnyChar will have to cease being so primitive and generate a pair of states instead of one, but that's essentially it - after primitive automatas are fixed to recognize utf-8 bytes, you don't even have to change parsing code that much.

          The only true problem I see is that you lose the ability to operate on char[]. But then, I ask that again, do you write a generic library, or you borrowed automata code from one with a single aim of having fast lucene queries?

          Show
          Earwin Burrfoot added a comment - Hmmmmm. I'd say that your highlevel operations like intersection and union remain exactly the same regardless of the alphabet you're operating on. The primitive automata, like AnyChar will have to cease being so primitive and generate a pair of states instead of one, but that's essentially it - after primitive automatas are fixed to recognize utf-8 bytes, you don't even have to change parsing code that much. The only true problem I see is that you lose the ability to operate on char[]. But then, I ask that again, do you write a generic library, or you borrowed automata code from one with a single aim of having fast lucene queries?
          Hide
          Robert Muir added a comment -

          Hmmmmm.

          I'd say that your highlevel operations like intersection and union remain exactly the same regardless of the alphabet you're operating on.
          The primitive automata, like AnyChar will have to cease being so primitive and generate a pair of states instead of one, but that's essentially it - after primitive automatas are fixed to recognize utf-8 bytes, you don't even have to change parsing code that much.

          The only true problem I see is that you lose the ability to operate on char[]. But then, I ask that again, do you write a generic library, or you borrowed automata code from one with a single aim of having fast lucene queries?

          Well this is a borrowed library, but that doesnt really matter. The trick is that UTF-16 and UTF-32 are much more efficient for the kind of processing the high-level component needs: doing things like NFA->DFA conversion and minimization. Its much better to do some of these quadratic algorithms on high-level unicode instead of byte, and get a minimal DFA. At the same time the intervals represent real things, so its debuggable, etc.

          So to me it makes perfect sense to change the transition's min/max from 'char' to 'int', which is trivial and won't require me to rewrite all the primitive automata. Things like NFA-DFA conversion will be actually faster, never slower for some text.

          This gives us the opportunity to 'compile' to UTF-8 or UTF-32 RunAutomata (although for the latter we cannot use the classmap trick since the alphabet will be large). This way, it can be used effectively at both a high and low level, and the code is easy to maintain.

          I can debug the code now with things like toString and toDot, I certainly cannot do this if the high-level code is changed to byte/UTF-8. It would be completely unmaintainable, and most likely slower overall due to doing quadratic things like determinism on exploded UTF-8 automata.

          Show
          Robert Muir added a comment - Hmmmmm. I'd say that your highlevel operations like intersection and union remain exactly the same regardless of the alphabet you're operating on. The primitive automata, like AnyChar will have to cease being so primitive and generate a pair of states instead of one, but that's essentially it - after primitive automatas are fixed to recognize utf-8 bytes, you don't even have to change parsing code that much. The only true problem I see is that you lose the ability to operate on char[]. But then, I ask that again, do you write a generic library, or you borrowed automata code from one with a single aim of having fast lucene queries? Well this is a borrowed library, but that doesnt really matter. The trick is that UTF-16 and UTF-32 are much more efficient for the kind of processing the high-level component needs: doing things like NFA->DFA conversion and minimization. Its much better to do some of these quadratic algorithms on high-level unicode instead of byte, and get a minimal DFA. At the same time the intervals represent real things, so its debuggable, etc. So to me it makes perfect sense to change the transition's min/max from 'char' to 'int', which is trivial and won't require me to rewrite all the primitive automata. Things like NFA-DFA conversion will be actually faster, never slower for some text. This gives us the opportunity to 'compile' to UTF-8 or UTF-32 RunAutomata (although for the latter we cannot use the classmap trick since the alphabet will be large). This way, it can be used effectively at both a high and low level, and the code is easy to maintain. I can debug the code now with things like toString and toDot, I certainly cannot do this if the high-level code is changed to byte/UTF-8. It would be completely unmaintainable, and most likely slower overall due to doing quadratic things like determinism on exploded UTF-8 automata.
          Hide
          Robert Muir added a comment -

          as i looked at this, i noticed some unused functionality (numeric fractions and the like).

          attached is a patch to remove it. I plan to commit soon.

          Show
          Robert Muir added a comment - as i looked at this, i noticed some unused functionality (numeric fractions and the like). attached is a patch to remove it. I plan to commit soon.
          Hide
          Michael McCandless added a comment -

          Attached patch for first cut at APIs to convert a UTF32 automaton to UTF8.

          There's a test case that verifies the conversion is working correctly. It seems to be working.

          This patch is just the low level API, ie, converting one edge containing a UTF32 range. I still need to fix it to convert an entire UTF32 DFA... should be straightforward.

          Also, I need to merge with Robert's int (UTF32) cutover and a UTF8RunAutomaton.

          Show
          Michael McCandless added a comment - Attached patch for first cut at APIs to convert a UTF32 automaton to UTF8. There's a test case that verifies the conversion is working correctly. It seems to be working. This patch is just the low level API, ie, converting one edge containing a UTF32 range. I still need to fix it to convert an entire UTF32 DFA... should be straightforward. Also, I need to merge with Robert's int (UTF32) cutover and a UTF8RunAutomaton.
          Hide
          Robert Muir added a comment -

          attached is a totally scary patch to convert the entire automaton lib to utf-32...
          (i didnt mess with any search code and obviously it wont even compile with this patch)

          Show
          Robert Muir added a comment - attached is a totally scary patch to convert the entire automaton lib to utf-32... (i didnt mess with any search code and obviously it wont even compile with this patch)
          Hide
          Michael McCandless added a comment -

          Patch w/ first cut at method to cutover whole UTF32 DFA -> UTF8 DFA (and... call determinize on it ).

          Show
          Michael McCandless added a comment - Patch w/ first cut at method to cutover whole UTF32 DFA -> UTF8 DFA (and... call determinize on it ).
          Hide
          Robert Muir added a comment -

          this is mike's patch + my patch + quick hack attempt... most but not all tests are passing

          Show
          Robert Muir added a comment - this is mike's patch + my patch + quick hack attempt... most but not all tests are passing
          Hide
          Robert Muir added a comment -

          ok I think i made some serious progress here, but i did find a bug in the utf32 -> utf8 dfa convertor.
          The problem is it does not handle at least the case where the initial state is an accept state.
          I created a testcase for this (TestUTF32SpecialCase), and included the python code back, as i figure you will probably fix it there first.

          I deleted the surrogate-seeking tests, like other nuances, if we switch to byte[] these won't behave the same, as these regexps
          are no longer defined.

          remaining is to switch the slow fuzzy to use codepoint calculations (to be consistent with the fast one).
          by the way, its really silly we have to unicode-convert just to get length in chars for that score calculation... ugh!

          Show
          Robert Muir added a comment - ok I think i made some serious progress here, but i did find a bug in the utf32 -> utf8 dfa convertor. The problem is it does not handle at least the case where the initial state is an accept state. I created a testcase for this (TestUTF32SpecialCase), and included the python code back, as i figure you will probably fix it there first. I deleted the surrogate-seeking tests, like other nuances, if we switch to byte[] these won't behave the same, as these regexps are no longer defined. remaining is to switch the slow fuzzy to use codepoint calculations (to be consistent with the fast one). by the way, its really silly we have to unicode-convert just to get length in chars for that score calculation... ugh!
          Hide
          Michael McCandless added a comment -

          The problem is it does not handle at least the case where the initial state is an accept state.

          OK this is a simple fix in the UTF32ToUTF8.convert method – I didn't set isAccept on the initial state – new patch attached that fixes this.

          Show
          Michael McCandless added a comment - The problem is it does not handle at least the case where the initial state is an accept state. OK this is a simple fix in the UTF32ToUTF8.convert method – I didn't set isAccept on the initial state – new patch attached that fixes this.
          Hide
          Michael McCandless added a comment -

          Checkpointing progress from Robert & I on this issue...

          The conversion is now done in Java.

          Show
          Michael McCandless added a comment - Checkpointing progress from Robert & I on this issue... The conversion is now done in Java.
          Hide
          Michael McCandless added a comment -

          Last patch was a bit stale – this one is current, and all tests pass.

          Show
          Michael McCandless added a comment - Last patch was a bit stale – this one is current, and all tests pass.
          Hide
          Robert Muir added a comment -

          So here are the advantages of the current patch:

          • full unicode support (Regular Expression, Wildcard, Fuzzy). for example, wildcard ? means codepoint, not code unit.
          • support for matching all unicode forms easily (utf8, utf16, utf32).
          • easy to support both native utf8 terms sort order, but also utf8-in-utf16 like we have now. this is not feasible with the existing utf16 representation.
          • easy to safely do dfa operations on Automaton. this is because there are no surrogates anymore. for example we can safely reverse any automaton to take advantage of Solr's leading wildcard support (e.g. support "leading" regexps, too)
          • better compatibility with lucene, because automaton is in sync with the terms format (byte). This could lead to future optimizations like TermsEnum exposing the 'shared prefix' of a term with the previous enumerated term.

          Unfortunately, there are currently a few disadvantages with the patch, but I think we can resolve these:

          • The linear fuzzy terms enum, from the old code, needs to be fixed and consistent and use utf32 calculations, too.
          • for huge dfas (eg fuzzy) there is some cost to the conversion, around 5ms one-time cost on my machine for very long strings. perhaps we can optimize some code here, its not blowing up though.

          So in my opinion, the first thing should be resolved before committing, and the second is nice-to-have and shouldn't block the improvement.

          Show
          Robert Muir added a comment - So here are the advantages of the current patch: full unicode support (Regular Expression, Wildcard, Fuzzy). for example, wildcard ? means codepoint, not code unit. support for matching all unicode forms easily (utf8, utf16, utf32). easy to support both native utf8 terms sort order, but also utf8-in-utf16 like we have now. this is not feasible with the existing utf16 representation. easy to safely do dfa operations on Automaton. this is because there are no surrogates anymore. for example we can safely reverse any automaton to take advantage of Solr's leading wildcard support (e.g. support "leading" regexps, too) better compatibility with lucene, because automaton is in sync with the terms format (byte). This could lead to future optimizations like TermsEnum exposing the 'shared prefix' of a term with the previous enumerated term. Unfortunately, there are currently a few disadvantages with the patch, but I think we can resolve these: The linear fuzzy terms enum, from the old code, needs to be fixed and consistent and use utf32 calculations, too. for huge dfas (eg fuzzy) there is some cost to the conversion, around 5ms one-time cost on my machine for very long strings. perhaps we can optimize some code here, its not blowing up though. So in my opinion, the first thing should be resolved before committing, and the second is nice-to-have and shouldn't block the improvement.
          Hide
          Michael McCandless added a comment -

          New patch (from Robert & I) – I think this one is ready to commit!

          The rest of FuzzyQuery (eg LinearFuzzyTermsEnum) is now cutover to
          code points (not UTF16 code units), and we've optimized various
          methods in the automaton package (especially det). Performance of
          automaton/fuzzy queries looks on par or a bit faster, compared to
          trunk.

          Show
          Michael McCandless added a comment - New patch (from Robert & I) – I think this one is ready to commit! The rest of FuzzyQuery (eg LinearFuzzyTermsEnum) is now cutover to code points (not UTF16 code units), and we've optimized various methods in the automaton package (especially det). Performance of automaton/fuzzy queries looks on par or a bit faster, compared to trunk.
          Hide
          Michael McCandless added a comment -

          New patch (from Robert & I) – I think this one is ready to commit!

          The rest of FuzzyQuery (eg LinearFuzzyTermsEnum) is now cutover to
          code points (not UTF16 code units), and we've optimized various
          methods in the automaton package (especially det). Performance of
          automaton/fuzzy queries looks on par or a bit faster, compared to
          trunk.

          Show
          Michael McCandless added a comment - New patch (from Robert & I) – I think this one is ready to commit! The rest of FuzzyQuery (eg LinearFuzzyTermsEnum) is now cutover to code points (not UTF16 code units), and we've optimized various methods in the automaton package (especially det). Performance of automaton/fuzzy queries looks on par or a bit faster, compared to trunk.
          Hide
          Michael McCandless added a comment -

          Latest patch (from Robert!) – strengthens tests, fixes one but in how common suffix was created for AutomatonTermsEnum, cleanup.

          Show
          Michael McCandless added a comment - Latest patch (from Robert!) – strengthens tests, fixes one but in how common suffix was created for AutomatonTermsEnum, cleanup.
          Hide
          Robert Muir added a comment -

          I think we are good to go here... we should look at getting this in soon, as it will then allow us to cutover to UTF-8 sort order.

          Show
          Robert Muir added a comment - I think we are good to go here... we should look at getting this in soon, as it will then allow us to cutover to UTF-8 sort order.
          Hide
          Uwe Schindler added a comment -

          The Generics-Police and Java-5-Police fixed compilation errors/warnings in revision 940743.

          Show
          Uwe Schindler added a comment - The Generics-Police and Java-5-Police fixed compilation errors/warnings in revision 940743.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development