Lucene - Core
  1. Lucene - Core
  2. LUCENE-1606

Automaton Query/Filter (scalable regex)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

      Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

      Some use cases I envision:
      1. lexicography/etc on large text corpora
      2. looking for things such as urls where the prefix is not constant (http:// or ftp://)

      The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

      The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

      1. Look at the portion that is OK (did not enter a reject state in the DFA)
      2. Generate the next possible String and seek to that.

      the Query simply wraps the filter with ConstantScoreQuery.

      I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

      1. automaton.patch
        19 kB
        Robert Muir
      2. automatonMultiQuery.patch
        34 kB
        Robert Muir
      3. automatonmultiqueryfuzzy.patch
        47 kB
        Robert Muir
      4. automatonMultiQuerySmart.patch
        35 kB
        Robert Muir
      5. automatonWithWildCard.patch
        36 kB
        Robert Muir
      6. automatonWithWildCard2.patch
        36 kB
        Robert Muir
      7. BenchWildcard.java
        4 kB
        Robert Muir
      8. LUCENE-1606_nodep.patch
        194 kB
        Robert Muir
      9. LUCENE-1606.patch
        213 kB
        Robert Muir
      10. LUCENE-1606.patch
        213 kB
        Robert Muir
      11. LUCENE-1606.patch
        214 kB
        Robert Muir
      12. LUCENE-1606.patch
        204 kB
        Robert Muir
      13. LUCENE-1606.patch
        211 kB
        Robert Muir
      14. LUCENE-1606.patch
        211 kB
        Robert Muir
      15. LUCENE-1606.patch
        208 kB
        Robert Muir
      16. LUCENE-1606.patch
        198 kB
        Robert Muir
      17. LUCENE-1606.patch
        198 kB
        Robert Muir
      18. LUCENE-1606.patch
        200 kB
        Uwe Schindler
      19. LUCENE-1606.patch
        199 kB
        Uwe Schindler
      20. LUCENE-1606.patch
        198 kB
        Uwe Schindler
      21. LUCENE-1606.patch
        192 kB
        Robert Muir
      22. LUCENE-1606.patch
        58 kB
        Robert Muir
      23. LUCENE-1606.patch
        47 kB
        Robert Muir
      24. LUCENE-1606-flex.patch
        211 kB
        Uwe Schindler
      25. LUCENE-1606-flex.patch
        216 kB
        Robert Muir
      26. LUCENE-1606-flex.patch
        212 kB
        Robert Muir
      27. LUCENE-1606-flex.patch
        212 kB
        Robert Muir
      28. LUCENE-1606-flex.patch
        213 kB
        Robert Muir
      29. LUCENE-1606-flex.patch
        213 kB
        Robert Muir
      30. LUCENE-1606-flex.patch
        230 kB
        Uwe Schindler
      31. LUCENE-1606-flex.patch
        276 kB
        Uwe Schindler
      32. LUCENE-1606-flex.patch
        276 kB
        Uwe Schindler
      33. LUCENE-1606-flex.patch
        234 kB
        Robert Muir
      34. LUCENE-1606-flex.patch
        212 kB
        Robert Muir
      35. LUCENE-1606-flex.patch
        197 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          btw, Thanks to Uwe, Mike, Mark for all the help here!

          Show
          Robert Muir added a comment - btw, Thanks to Uwe, Mike, Mark for all the help here!
          Hide
          Robert Muir added a comment -

          Committed revision 888891.

          Show
          Robert Muir added a comment - Committed revision 888891.
          Hide
          Robert Muir added a comment -

          Ok, if no one objects I will (heavy) commit this to the flex branch tomorrow.

          The only differences from Uwe's patch will be:

          • ensure the barred-O (ø) is corrrect in Anders name for the NOTICE.txt
          • remove the unused instance variable in the enum, as it is unused and irrelevant for FilteredTermsEnum
          Show
          Robert Muir added a comment - Ok, if no one objects I will (heavy) commit this to the flex branch tomorrow. The only differences from Uwe's patch will be: ensure the barred-O (ø) is corrrect in Anders name for the NOTICE.txt remove the unused instance variable in the enum, as it is unused and irrelevant for FilteredTermsEnum
          Hide
          Michael McCandless added a comment -

          I think it would be nice to start looking at committing this to flex so we do not have to work with huge patches?

          +1

          Show
          Michael McCandless added a comment - I think it would be nice to start looking at committing this to flex so we do not have to work with huge patches? +1
          Hide
          Robert Muir added a comment -

          So what do you guys think? I am pretty satisfied with how this enum looks in flex branch myself.

          I think it would be nice to start looking at committing this to flex so we do not have to work with huge patches?

          If there are reservations please speak up, I think the automaton code imported here is solid, and if you run clover you will see our testcases exercise the important bits (not .toString or .toDot <graphviz> or other things, but those work too).

          If you have concerns or think it is confusing, i will do my best to try to figure out ways to simplify or improve it from here.

          Show
          Robert Muir added a comment - So what do you guys think? I am pretty satisfied with how this enum looks in flex branch myself. I think it would be nice to start looking at committing this to flex so we do not have to work with huge patches? If there are reservations please speak up, I think the automaton code imported here is solid, and if you run clover you will see our testcases exercise the important bits (not .toString or .toDot <graphviz> or other things, but those work too). If you have concerns or think it is confusing, i will do my best to try to figure out ways to simplify or improve it from here.
          Hide
          Uwe Schindler added a comment -

          I updated the patch because of my last commit.
          Your's looks good, my change was only adding the method param and removing the access to the noew private tenum.

          Show
          Uwe Schindler added a comment - I updated the patch because of my last commit. Your's looks good, my change was only adding the method param and removing the access to the noew private tenum.
          Hide
          Robert Muir added a comment -

          here i removed the use of String in the enum.
          this seems to help a bit when seeking.

          instead a char[] is reused, and nextString() etc returns boolean if more solutions exist.
          I think its actually more readable in a way, need to reorganize a bit more but I need a break from this enum.

          Show
          Robert Muir added a comment - here i removed the use of String in the enum. this seems to help a bit when seeking. instead a char[] is reused, and nextString() etc returns boolean if more solutions exist. I think its actually more readable in a way, need to reorganize a bit more but I need a break from this enum.
          Hide
          Robert Muir added a comment -

          this builds off the last improvement, and uses RunAutomaton (the tablelized DFA) for nextString.

          this means parts of nextString are now O where n is number of states, instead of transitions.
          doesn't mean much for wildcard as these DFAs typically do not have many transitions per state, but for more complex DFAs such as regexp/wildcard this helps... and it simplifies code.

          Show
          Robert Muir added a comment - this builds off the last improvement, and uses RunAutomaton (the tablelized DFA) for nextString. this means parts of nextString are now O where n is number of states, instead of transitions. doesn't mean much for wildcard as these DFAs typically do not have many transitions per state, but for more complex DFAs such as regexp/wildcard this helps... and it simplifies code.
          Hide
          Robert Muir added a comment -

          this is an update to improve performance for lots of seeking (wildcards like ?????NN, crazy regular expressions, fuzzy)

          • expose State.getNumber() as an expert method. this returns the consecutive integer of the state in the automaton.
          • (re)use a bitset for tracking 'visited' instead of creating a hashset for each seek
          • use an array indexed by state number for caching transitions, instead of a hashmap.
          Show
          Robert Muir added a comment - this is an update to improve performance for lots of seeking (wildcards like ?????NN, crazy regular expressions, fuzzy) expose State.getNumber() as an expert method. this returns the consecutive integer of the state in the automaton. (re)use a bitset for tracking 'visited' instead of creating a hashset for each seek use an array indexed by state number for caching transitions, instead of a hashmap.
          Hide
          Robert Muir added a comment -

          the current wildcardquery has getTerm(), this is needed for bw compat. I added it.

          Show
          Robert Muir added a comment - the current wildcardquery has getTerm(), this is needed for bw compat. I added it.
          Hide
          Robert Muir added a comment -

          latest flex patch, after LUCENE-2110 and LUCENE-2121

          Show
          Robert Muir added a comment - latest flex patch, after LUCENE-2110 and LUCENE-2121
          Hide
          Robert Muir added a comment -

          Setting this issue as depending on LUCENE-2111

          This is because I do not feel comfortable committing this code in trunk, it is too complicated there.
          Instead I would like simply work it against the flex branch where we can make it nice

          Show
          Robert Muir added a comment - Setting this issue as depending on LUCENE-2111 This is because I do not feel comfortable committing this code in trunk, it is too complicated there. Instead I would like simply work it against the flex branch where we can make it nice
          Hide
          Robert Muir added a comment -

          Mike I created LUCENE-2121 for this.
          If you get a chance to review it, I can create a new version of the flex branch patch for this issue... this would resolve one of my "big 3 complaints" about complexity of the code.

          Show
          Robert Muir added a comment - Mike I created LUCENE-2121 for this. If you get a chance to review it, I can create a new version of the flex branch patch for this issue... this would resolve one of my "big 3 complaints" about complexity of the code.
          Hide
          Michael McCandless added a comment -

          one thing we could do is put this nextValidUTF16String in UnicodeUtil and also use it in SegmentReader.LegacyTermEnum to replace the "hack", just in case someone else wrote an enum like mine.

          +1

          Show
          Michael McCandless added a comment - one thing we could do is put this nextValidUTF16String in UnicodeUtil and also use it in SegmentReader.LegacyTermEnum to replace the "hack", just in case someone else wrote an enum like mine. +1
          Hide
          Robert Muir added a comment -

          btw one thing we could do is put this nextValidUTF16String in UnicodeUtil and also use it in SegmentReader.LegacyTermEnum to replace the "hack", just in case someone else wrote an enum like mine.

          this would provide better backwards compatibility, as they would receive the 'next Term' in IndexReader.terms() just as they did before, even if the term is completely jacked...

          below is the existing hack:

             // this is a hack only for backwards compatibility.
             // previously you could supply a term ending with a lead surrogate,
             // and it would return the next Term.
             // if someone does this, tack on the lowest possible trail surrogate.
             // this emulates the old behavior, and forms "valid UTF-8" unicode.
             if (text.length() > 0 
                && Character.isHighSurrogate(text.charAt(text.length() - 1)))
                  tr = new TermRef(t.text() + "\uDC00");
             else
                  tr = new TermRef(t.text());
          

          instead it could read something like tr = new TermRef(UnicodeUtil.nextValidUTF16String(t.text()));

          Show
          Robert Muir added a comment - btw one thing we could do is put this nextValidUTF16String in UnicodeUtil and also use it in SegmentReader.LegacyTermEnum to replace the "hack", just in case someone else wrote an enum like mine. this would provide better backwards compatibility, as they would receive the 'next Term' in IndexReader.terms() just as they did before, even if the term is completely jacked... below is the existing hack: // this is a hack only for backwards compatibility. // previously you could supply a term ending with a lead surrogate, // and it would return the next Term. // if someone does this , tack on the lowest possible trail surrogate. // this emulates the old behavior, and forms "valid UTF-8" unicode. if (text.length() > 0 && Character .isHighSurrogate(text.charAt(text.length() - 1))) tr = new TermRef(t.text() + "\uDC00" ); else tr = new TermRef(t.text()); instead it could read something like tr = new TermRef(UnicodeUtil.nextValidUTF16String(t.text()));
          Hide
          Robert Muir added a comment -

          here is an update to the last one, using UnicodeUtil constants, etc.

          So I think I do not absolutely hate the unicode handling code in this enum anymore.

          Show
          Robert Muir added a comment - here is an update to the last one, using UnicodeUtil constants, etc. So I think I do not absolutely hate the unicode handling code in this enum anymore.
          Hide
          Robert Muir added a comment -

          in this patch, i take some commented out code in UnicodeUtil (validUTF16String), and pervert it slightly into nextValidUTF16String.

          all the tests pass using this on trunk and flex, and I think it reads much easier.

          Show
          Robert Muir added a comment - in this patch, i take some commented out code in UnicodeUtil (validUTF16String), and pervert it slightly into nextValidUTF16String. all the tests pass using this on trunk and flex, and I think it reads much easier.
          Hide
          Robert Muir added a comment - - edited

          here is an explanation of the cleanupPosition, and the cases it handles

          Symbols:
          A = \u0000 - \uD7FFF (lower BMP)
          H = \uD800 - \uDBFFF (high/lead surrogates)
          L = \uDC00 - \uDFFFF (low/trail surrogates)
          Z = \uE000 - \uFFFF (upper BMP)

          case 1:
          // an illegal combination, where we have not yet enumerated into the supp. plane
          // so we increment to H + \uDC00 (the lowest possible trail surrogate)
          HA -> H \uDC00
          HH -> H \uDC00
          case 2:
          // an illegal combination where we have already enumerated the supp. plane
          // we must replace both H and Z with \uE000 (the lowest possible "upper BMP")
          HZ -> \uE000
          case 3:
          // an illegal combination where we have a final lead surrogate.
          // we have not yet enumerated the supp plane, so append \uDC00 (lowest possible trail surrogate)
          // this is just like case 1, except in final position.
          H$ -> H \uDC00
          case 4:
          // an unpaired trail surrogate. this is invalid when not preceded by lead surrogate
          // (and if there was one, the above rules would have dealt with it already)
          // in this case we have to bump to \uE000 (the lowest possible "upper BMP")
          unpaired L -> \uE000
          case 5:
          // this is just like case 4, its obviously illegal because the term starts with a trail surrogate.
          // (because it is in initial position)
          ^L -> \uE000

          edit: sorry for the many edits

          Show
          Robert Muir added a comment - - edited here is an explanation of the cleanupPosition, and the cases it handles Symbols: A = \u0000 - \uD7FFF (lower BMP) H = \uD800 - \uDBFFF (high/lead surrogates) L = \uDC00 - \uDFFFF (low/trail surrogates) Z = \uE000 - \uFFFF (upper BMP) case 1: // an illegal combination, where we have not yet enumerated into the supp. plane // so we increment to H + \uDC00 (the lowest possible trail surrogate) HA -> H \uDC00 HH -> H \uDC00 case 2: // an illegal combination where we have already enumerated the supp. plane // we must replace both H and Z with \uE000 (the lowest possible "upper BMP") HZ -> \uE000 case 3: // an illegal combination where we have a final lead surrogate. // we have not yet enumerated the supp plane, so append \uDC00 (lowest possible trail surrogate) // this is just like case 1, except in final position. H$ -> H \uDC00 case 4: // an unpaired trail surrogate. this is invalid when not preceded by lead surrogate // (and if there was one, the above rules would have dealt with it already) // in this case we have to bump to \uE000 (the lowest possible "upper BMP") unpaired L -> \uE000 case 5: // this is just like case 4, its obviously illegal because the term starts with a trail surrogate. // (because it is in initial position) ^L -> \uE000 edit: sorry for the many edits
          Hide
          Robert Muir added a comment -

          yes, Mark you have it right. This is not an issue for trunk, only flex, but I fixed it ahead of time to prevent problems.

          so i have a chinese example here: https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12785034&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12785034

          but at the same time, this kind of thing can happen with a simple wildcard *, or ?
          Because a wildcard ? will be converted into a UTF-16 DFA transition of \u0000-\uFFFF, when enumerating this transition (depending on your term dictionary), we cannot create a nextString for example that ends with say, \uD800, else it will get replaced with \uFFFD, and we will miss terms.

          The enumeration in nextString is greedy, so we can't just look for these in final position either. For example, if you have a regexp of [lm]ar[kl], this enum first tries to seek to lark (it will continue to walk transitions until an accept state or a loop).

          Show
          Robert Muir added a comment - yes, Mark you have it right. This is not an issue for trunk, only flex, but I fixed it ahead of time to prevent problems. so i have a chinese example here: https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12785034&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12785034 but at the same time, this kind of thing can happen with a simple wildcard *, or ? Because a wildcard ? will be converted into a UTF-16 DFA transition of \u0000-\uFFFF, when enumerating this transition (depending on your term dictionary), we cannot create a nextString for example that ends with say, \uD800, else it will get replaced with \uFFFD, and we will miss terms. The enumeration in nextString is greedy, so we can't just look for these in final position either. For example, if you have a regexp of [lm] ar [kl] , this enum first tries to seek to lark (it will continue to walk transitions until an accept state or a loop).
          Hide
          Mark Miller added a comment - - edited

          Sorry - haven't been paying a lot of attention to all of the Unicode issues/talk lately.

          Could you briefly explain cleanupPosition? Whats the case where a seek position cannot be converted to UTF-8?

          edit

          Because of next string guesses that might not be valid UTF-8?

          Show
          Mark Miller added a comment - - edited Sorry - haven't been paying a lot of attention to all of the Unicode issues/talk lately. Could you briefly explain cleanupPosition? Whats the case where a seek position cannot be converted to UTF-8? edit Because of next string guesses that might not be valid UTF-8?
          Hide
          Robert Muir added a comment -

          Also going over the code, but thats going to take more time.

          Btw, I will accept any criticism here. I am not happy with the complexity of the enum in the trunk patch, personally.
          But here are the three main issues that I think make it complex: (not to try to place blame elsewhere)

          • This trie<->DFA intersection is inherently something i would want to define recursively, but this would be obviously bad.
          • The DFA library uses UTF-16 whereas TermRef requires UTF-8. Changing automaton to use 'int' would fix this, but then would destroy performance. The reason brics is the fastest java regex library is that it tableizes the DFA into a 64k UTF-16 char[]. See RunAutomaton for the impl. I think making this require 1MB for the corner cases is bad.
          • MultiTermQuerys that seek around are pretty complex in trunk. In my opinion this enum is a lot easier to understand with the improvements Uwe is working on for FilteredTermsEnum (see his branch patch, I think its easier there).

          if you have ideas how we can simplify any of this in trunk for easier readability (instead of just adding absurd amounts of comments as I did), I'd be very interested.

          Show
          Robert Muir added a comment - Also going over the code, but thats going to take more time. Btw, I will accept any criticism here. I am not happy with the complexity of the enum in the trunk patch, personally. But here are the three main issues that I think make it complex: (not to try to place blame elsewhere) This trie<->DFA intersection is inherently something i would want to define recursively, but this would be obviously bad. The DFA library uses UTF-16 whereas TermRef requires UTF-8. Changing automaton to use 'int' would fix this, but then would destroy performance. The reason brics is the fastest java regex library is that it tableizes the DFA into a 64k UTF-16 char[]. See RunAutomaton for the impl. I think making this require 1MB for the corner cases is bad. MultiTermQuerys that seek around are pretty complex in trunk. In my opinion this enum is a lot easier to understand with the improvements Uwe is working on for FilteredTermsEnum (see his branch patch, I think its easier there). if you have ideas how we can simplify any of this in trunk for easier readability (instead of just adding absurd amounts of comments as I did), I'd be very interested.
          Hide
          Mark Miller added a comment -

          Yeah I tried to do some of this in a very quick way if you look at the tests... I generate some random wildcard/regexp queries (mainly to prevent bugs from being introduced).

          Yeah, I think the tests are pretty solid (from the briefs looks I've had thus far) - this is mainly just a precaution - so that we are not surprised by a more realistic corpus. And to have the opportunity to compare with the old WildcardQuery - I'd rather not keep it around for tests - once we are confident its the same (and I am at this point), I'm happy to see it fade into the night. Replacing such a core piece though, I want to be absolutely sure everything is on the level.

          bq, they are definitely contrived but I think cover the bases for any unicode problems.

          Right - in terms of unit tests, I think you've done great based on what I've seen. This is just throwing more variety at a larger more realistic corpus. More of a one time deal than something that should be incorporated into the tests. Ensures there are no surprises for me - since I didn't write any of this code (and I'm not yet super familiar with it), it helps with my comfort level

          One problem is that none of this unicode stuff is ever a problem on trunk!

          Yeah - I assumed not. But as I'm not that familiar with the automaton stuff yet, I wanted to be sure there wasn't going to be any input that somehow confused it. I realize that your familiarity level probably tells you thats not possible - but mine puts me in the position of testing anyway - else I'll look like a moron when I +1 this thing

          If you save this test setup,

          I'll save it for sure.

          Show
          Mark Miller added a comment - Yeah I tried to do some of this in a very quick way if you look at the tests... I generate some random wildcard/regexp queries (mainly to prevent bugs from being introduced). Yeah, I think the tests are pretty solid (from the briefs looks I've had thus far) - this is mainly just a precaution - so that we are not surprised by a more realistic corpus. And to have the opportunity to compare with the old WildcardQuery - I'd rather not keep it around for tests - once we are confident its the same (and I am at this point), I'm happy to see it fade into the night. Replacing such a core piece though, I want to be absolutely sure everything is on the level. bq, they are definitely contrived but I think cover the bases for any unicode problems. Right - in terms of unit tests, I think you've done great based on what I've seen. This is just throwing more variety at a larger more realistic corpus. More of a one time deal than something that should be incorporated into the tests. Ensures there are no surprises for me - since I didn't write any of this code (and I'm not yet super familiar with it), it helps with my comfort level One problem is that none of this unicode stuff is ever a problem on trunk! Yeah - I assumed not. But as I'm not that familiar with the automaton stuff yet, I wanted to be sure there wasn't going to be any input that somehow confused it. I realize that your familiarity level probably tells you thats not possible - but mine puts me in the position of testing anyway - else I'll look like a moron when I +1 this thing If you save this test setup, I'll save it for sure.
          Hide
          Robert Muir added a comment -

          Mark oh ok, well thanks for spending so much time here testing and reviewing.

          I really just wanted to make sure every random query acts the same with both impls and that no random input can somehow screw things up (Im using commons lang to pump in random unicode strings, along with turning the dict entires into wildcards that more likely to get many hits).

          Yeah I tried to do some of this in a very quick way if you look at the tests... I generate some random wildcard/regexp queries (mainly to prevent bugs from being introduced).

          The unicode tests (TestAutomatonUnicode) took me quite some time, they are definitely contrived but I think cover the bases for any unicode problems.
          One problem is that none of this unicode stuff is ever a problem on trunk!

          If you save this test setup, maybe in the future I can trick you into running your tests on flex, where the unicode handling matters (as TermRef must be valid UTF-8 there).

          Show
          Robert Muir added a comment - Mark oh ok, well thanks for spending so much time here testing and reviewing. I really just wanted to make sure every random query acts the same with both impls and that no random input can somehow screw things up (Im using commons lang to pump in random unicode strings, along with turning the dict entires into wildcards that more likely to get many hits). Yeah I tried to do some of this in a very quick way if you look at the tests... I generate some random wildcard/regexp queries (mainly to prevent bugs from being introduced). The unicode tests (TestAutomatonUnicode) took me quite some time, they are definitely contrived but I think cover the bases for any unicode problems. One problem is that none of this unicode stuff is ever a problem on trunk! If you save this test setup, maybe in the future I can trick you into running your tests on flex, where the unicode handling matters (as TermRef must be valid UTF-8 there).
          Hide
          Mark Miller added a comment -

          And definitely correctness....

          Right - thats my main motivation - comparing the results of the old wildcardquery with the new - I actually put the timings in there as an after thought - just because I was curious.

          I really just wanted to make sure every random query acts the same with both impls and that no random input can somehow screw things up (Im using commons lang to pump in random unicode strings, along with turning the dict entires into wildcards that more likely to get many hits).

          Didn't expect to find anything, but it will make me feel better about +1ing the commit

          Also going over the code, but thats going to take more time.

          Show
          Mark Miller added a comment - And definitely correctness.... Right - thats my main motivation - comparing the results of the old wildcardquery with the new - I actually put the timings in there as an after thought - just because I was curious. I really just wanted to make sure every random query acts the same with both impls and that no random input can somehow screw things up (Im using commons lang to pump in random unicode strings, along with turning the dict entires into wildcards that more likely to get many hits). Didn't expect to find anything, but it will make me feel better about +1ing the commit Also going over the code, but thats going to take more time.
          Hide
          Robert Muir added a comment -

          Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.

          oh ok, I didnt know. Because my benchmark as Mike said, is definitely very "contrived".

          But its kind of realistic, there are situations where the number of terms compared to the number of docs is much higher (maybe even 1-1 for unique product ids and things like that).

          I am glad you did this test, because I was concerned about the "small index" case too. And definitely correctness....

          I think you are right about the partial dump. I am indexing the full dump now (at least I think). I will look at it too, at least for curiousity sake.

          Show
          Robert Muir added a comment - Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing. oh ok, I didnt know. Because my benchmark as Mike said, is definitely very "contrived". But its kind of realistic, there are situations where the number of terms compared to the number of docs is much higher (maybe even 1-1 for unique product ids and things like that). I am glad you did this test, because I was concerned about the "small index" case too. And definitely correctness.... I think you are right about the partial dump. I am indexing the full dump now (at least I think). I will look at it too, at least for curiousity sake.
          Hide
          Mark Miller added a comment - - edited

          I think thats pretty small

          Okay, fair enough Guess it depends on your idea of small - though I would have guess (wrongly it appears), that it would be more. One diff is that I think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 gig dump - though that prob doesn't add too many more in the scheme of things.

          More interesting to see the benefits...

          Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.

          Show
          Mark Miller added a comment - - edited I think thats pretty small Okay, fair enough Guess it depends on your idea of small - though I would have guess (wrongly it appears), that it would be more. One diff is that I think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 gig dump - though that prob doesn't add too many more in the scheme of things. More interesting to see the benefits... Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.
          Hide
          Robert Muir added a comment -

          I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually.

          I looked at the wikipedia dump in benchmark (when indexed with standardanalyzer), body only has 65k terms... I think thats pretty small
          I do not think automaton will help much with such a small number of terms, its definitely a worst case benchmark you are performing.
          I think very little time is probably spent here in term enumeration so scalability does not matter for that corpus.

          More interesting to see the benefits would be something like indexing geonames data (lots of terms), or even that (much smaller) persian corpus i mentioned with nearly 500k terms...

          Show
          Robert Muir added a comment - I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually. I looked at the wikipedia dump in benchmark (when indexed with standardanalyzer), body only has 65k terms... I think thats pretty small I do not think automaton will help much with such a small number of terms, its definitely a worst case benchmark you are performing. I think very little time is probably spent here in term enumeration so scalability does not matter for that corpus. More interesting to see the benefits would be something like indexing geonames data (lots of terms), or even that (much smaller) persian corpus i mentioned with nearly 500k terms...
          Hide
          Robert Muir added a comment -

          I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually.

          See the description, I created this for working mainly regexp on indexes with 100M+ unique terms.
          Wildcard doesn't get as much benefit, except ? operator and the comparisons being faster (array-based DFA)

          I'm pleased to hear its doing so well on such a "small" index as wikipedia, as I would think automata overhead would make it slower (although this can probably be optimized away)

          Show
          Robert Muir added a comment - I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually. See the description, I created this for working mainly regexp on indexes with 100M+ unique terms. Wildcard doesn't get as much benefit, except ? operator and the comparisons being faster (array-based DFA) I'm pleased to hear its doing so well on such a "small" index as wikipedia, as I would think automata overhead would make it slower (although this can probably be optimized away)
          Hide
          Uwe Schindler added a comment -

          again - krr to the hell with the AM/PM bug in JIRA! It is ****xxx**

          Show
          Uwe Schindler added a comment - again - krr to the hell with the AM/PM bug in JIRA! It is **** xxx **
          Hide
          Mark Miller added a comment -

          how many uniq terms is the field you are testing

          I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually. It is hitting the standard analyzer going in (mainly because I didn't think about changing it on building the indexes). And the queries are getting hit with the lowercase filter (stole the code anyway).

          Show
          Mark Miller added a comment - how many uniq terms is the field you are testing I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually. It is hitting the standard analyzer going in (mainly because I didn't think about changing it on building the indexes). And the queries are getting hit with the lowercase filter (stole the code anyway).
          Hide
          Uwe Schindler added a comment -

          Here is the patch with the getEnum/getTermsEnum changes instead of rewrite but with reverted LUCENE-2110, which was stupid.

          Show
          Uwe Schindler added a comment - Here is the patch with the getEnum/getTermsEnum changes instead of rewrite but with reverted LUCENE-2110 , which was stupid.
          Hide
          Robert Muir added a comment -

          Mark, thanks for testing!

          Yes, the new wildcard should really only help for ? with trunk (especially leading ?)
          With flex it should help a lot more, even leading * gets the benefit of "common suffix" and byte[] comparison and things like that.
          This code is in the trunk patch but does not really help yet because trunk enum works on String.

          btw how many uniq terms is the field you are testing... this is where it starts to help with ?, when you have a ton of unique terms.
          But I am glad you are testing with hopefully a smaller # of uniq terms, this is probably more common.

          Show
          Robert Muir added a comment - Mark, thanks for testing! Yes, the new wildcard should really only help for ? with trunk (especially leading ?) With flex it should help a lot more, even leading * gets the benefit of "common suffix" and byte[] comparison and things like that. This code is in the trunk patch but does not really help yet because trunk enum works on String. btw how many uniq terms is the field you are testing... this is where it starts to help with ?, when you have a ton of unique terms. But I am glad you are testing with hopefully a smaller # of uniq terms, this is probably more common.
          Hide
          Mark Miller added a comment - - edited

          The new WildcardQuery is holding up very well under random testing -

          I'm comparing the results of the old WildcardQuery impl with the new WildcardQuery impl.

          I'm using a 2 million doc english and 2 million doc french index. (wikipedia dumps)

          Generating random queries - both random short strings built up from random unicode chars mixed with some random wildcards, and random english/french words from dictionaries, randomly chopped or not, with random wildcards injected. A whole lot of crazy randomness.

          They have always produced the same number of results so far (a few hours of running).

          The new impl is generally either a bit faster in these cases, or about the same - at worst (in general), I've seen it about .01s slower. When its faster, its offten > .1s faster (or more when a few '?' are involved).

          On avg, I'd say the perf is about the same - where the new impl shines appears to be when '?' is used (as I think Robert has mentioned).

          So far I haven't seen any anomalies in time taken or anything of that nature.

          Show
          Mark Miller added a comment - - edited The new WildcardQuery is holding up very well under random testing - I'm comparing the results of the old WildcardQuery impl with the new WildcardQuery impl. I'm using a 2 million doc english and 2 million doc french index. (wikipedia dumps) Generating random queries - both random short strings built up from random unicode chars mixed with some random wildcards, and random english/french words from dictionaries, randomly chopped or not, with random wildcards injected. A whole lot of crazy randomness. They have always produced the same number of results so far (a few hours of running). The new impl is generally either a bit faster in these cases, or about the same - at worst (in general), I've seen it about .01s slower. When its faster, its offten > .1s faster (or more when a few '?' are involved). On avg, I'd say the perf is about the same - where the new impl shines appears to be when '?' is used (as I think Robert has mentioned). So far I haven't seen any anomalies in time taken or anything of that nature.
          Hide
          Uwe Schindler added a comment -

          Now the final one.

          I somehow need a test enum which does very strange things like seeking forward and backwards and returning all strange stati.

          Will think about one tomorrow.

          Show
          Uwe Schindler added a comment - Now the final one. I somehow need a test enum which does very strange things like seeking forward and backwards and returning all strange stati. Will think about one tomorrow.
          Hide
          Uwe Schindler added a comment -

          Stop everything I get a collaps!!!!! Again wrong patch.

          Show
          Uwe Schindler added a comment - Stop everything I get a collaps!!!!! Again wrong patch.
          Hide
          Uwe Schindler added a comment -

          There was a bug in the patch before, sorry. I will finish work for today, I am exhausted like the enums.

          Show
          Uwe Schindler added a comment - There was a bug in the patch before, sorry. I will finish work for today, I am exhausted like the enums.
          Hide
          Uwe Schindler added a comment -

          An update with the changed nextSeekTerm() semantics from LUCENE-2110.

          Robert: Can you test performance again and compare with old?

          Show
          Uwe Schindler added a comment - An update with the changed nextSeekTerm() semantics from LUCENE-2110 . Robert: Can you test performance again and compare with old?
          Hide
          Robert Muir added a comment -

          Hi Uwe, I ran my benchmarks, and with your patch the performance is the same.

          But the code is much simpler and easier to read... great work.

          Show
          Robert Muir added a comment - Hi Uwe, I ran my benchmarks, and with your patch the performance is the same. But the code is much simpler and easier to read... great work.
          Hide
          Uwe Schindler added a comment -

          New patch, there was a lost private field. Also changed the nextSeekTerm method to be more straigtForward.

          Robert: Sorry, it would be better to test this one g

          Show
          Uwe Schindler added a comment - New patch, there was a lost private field. Also changed the nextSeekTerm method to be more straigtForward. Robert: Sorry, it would be better to test this one g
          Hide
          Robert Muir added a comment -

          Robert: Can you do performance tests with the old and new flex patch, I do not want to commit 2110 before.

          Uwe I will run a benchmark on both versions!

          Show
          Robert Muir added a comment - Robert: Can you do performance tests with the old and new flex patch, I do not want to commit 2110 before. Uwe I will run a benchmark on both versions!
          Hide
          Uwe Schindler added a comment -

          Here a flex patch for automaton. It contains LUCENE-2110, as soon as 2110 is committed I will upload a new patch. But its hard to differentiate between all modified files.

          Robert: Can you do performance tests with the old and new flex patch, I do not want to commit 2110 before.

          Show
          Uwe Schindler added a comment - Here a flex patch for automaton. It contains LUCENE-2110 , as soon as 2110 is committed I will upload a new patch. But its hard to differentiate between all modified files. Robert: Can you do performance tests with the old and new flex patch, I do not want to commit 2110 before.
          Hide
          Robert Muir added a comment -

          btw this patch is a bit different than the last port to flex in one way.
          Like the trunk patch the commonSuffix is only computed for "linear mode" aka slow queries.
          but computing this for flex will be a win even for faster queries in "smart mode", because it can dodge more unicode conversion with TermRef byte[] comparison.

          the problem is my implementation of getCommonSuffix() is a little crappy, reverse the entire automaton, redeterminize, take its common prefix, then reverse that.
          instead, improving reverse() so that it keeps determinism when its already a DFA (DFA->DFA) will allow this commonSuffix to be used in both modes without any concern that it will ever hurt performance.

          Show
          Robert Muir added a comment - btw this patch is a bit different than the last port to flex in one way. Like the trunk patch the commonSuffix is only computed for "linear mode" aka slow queries. but computing this for flex will be a win even for faster queries in "smart mode", because it can dodge more unicode conversion with TermRef byte[] comparison. the problem is my implementation of getCommonSuffix() is a little crappy, reverse the entire automaton, redeterminize, take its common prefix, then reverse that. instead, improving reverse() so that it keeps determinism when its already a DFA (DFA->DFA) will allow this commonSuffix to be used in both modes without any concern that it will ever hurt performance.
          Hide
          Robert Muir added a comment -

          attached is a port of the latest trunk patch to flex branch, for experimenting or whatever.

          Show
          Robert Muir added a comment - attached is a port of the latest trunk patch to flex branch, for experimenting or whatever.
          Hide
          Mark Miller added a comment -

          I'll play with it some for one. Fantastic commenting man - this whole patch is pretty darn thorough.

          Show
          Mark Miller added a comment - I'll play with it some for one. Fantastic commenting man - this whole patch is pretty darn thorough.
          Hide
          Robert Muir added a comment -

          I added random testing for wildcards and regexps.
          Don't know what else needs to be done here, please review if you can.

          Show
          Robert Muir added a comment - I added random testing for wildcards and regexps. Don't know what else needs to be done here, please review if you can.
          Hide
          Robert Muir added a comment - - edited

          edit: edit out my chinese chars and replaced with <chineseCharOutsideBMP> as there are some problems indexing this comment.

          btw the unicode complexity i mention here is not just me being anal, its an impedence mismatch between the automaton library using UTF-16 code unit representation and the new flex api requiring valid UTF-8.

          I am not trying to introduce complexity for no reason, here is an example:

          RegExp re = new RegExp("(<chineseCharOutsideBMP>|<chineseCharOutsideBMP>)");
          System.out.println(re.toAutomaton());
          
          initial state: 0
          state 0 [reject]:
            \ud866 -> 2
          state 1 [accept]:
          state 2 [reject]:
            \udf05-\udf06 -> 1
          

          as you can see, the automaton library handles these characters correctly, but as code units.
          so its important not to seek to invalid locations when walking thru the DFA, because these will be replaced by U+FFFD,
          and terms could be skipped, or we go backwards, creating a loop.
          Thats why i spent so much time on this.

          Show
          Robert Muir added a comment - - edited edit: edit out my chinese chars and replaced with <chineseCharOutsideBMP> as there are some problems indexing this comment. btw the unicode complexity i mention here is not just me being anal, its an impedence mismatch between the automaton library using UTF-16 code unit representation and the new flex api requiring valid UTF-8. I am not trying to introduce complexity for no reason, here is an example: RegExp re = new RegExp( "(<chineseCharOutsideBMP>|<chineseCharOutsideBMP>)" ); System .out.println(re.toAutomaton()); initial state: 0 state 0 [reject]: \ud866 -> 2 state 1 [accept]: state 2 [reject]: \udf05-\udf06 -> 1 as you can see, the automaton library handles these characters correctly, but as code units. so its important not to seek to invalid locations when walking thru the DFA, because these will be replaced by U+FFFD, and terms could be skipped, or we go backwards, creating a loop. Thats why i spent so much time on this.
          Hide
          Robert Muir added a comment -

          here is an updated patch. In my opinion all that is needed is to add the random testing, and code reorganization (but no algorithmic/feature changes), maybe better docs too.

          This patch has:

          • more tests, especially surrogate handling
          • some more automaton paring
          • the seek positions and common suffix are always valid UTF-8 strings
          • backtracking is aware of U+FFFF, so terms can have it.

          This patch works correctly on both trunk and flex, although I don't have an included test for U+FFFF since I can't put it in a trunk index.

          I'm not too terribly happy about the way the unicode handling works here, its correct but could be coded better.
          The code I wrote is not the easiest to read and suggestions welcome
          But it is all localized to one method, cleanupPosition(). This is defined as

          * if the seek position cannot be converted to valid UTF-8,
          * then return the next valid String (in UTF-16 sort order) that
          * can be converted to valid UTF-8.
          

          if you have ideas on how to make this nicer I am happy to hear them.

          Show
          Robert Muir added a comment - here is an updated patch. In my opinion all that is needed is to add the random testing, and code reorganization (but no algorithmic/feature changes), maybe better docs too. This patch has: more tests, especially surrogate handling some more automaton paring the seek positions and common suffix are always valid UTF-8 strings backtracking is aware of U+FFFF, so terms can have it. This patch works correctly on both trunk and flex, although I don't have an included test for U+FFFF since I can't put it in a trunk index. I'm not too terribly happy about the way the unicode handling works here, its correct but could be coded better. The code I wrote is not the easiest to read and suggestions welcome But it is all localized to one method, cleanupPosition(). This is defined as * if the seek position cannot be converted to valid UTF-8, * then return the next valid String (in UTF-16 sort order) that * can be converted to valid UTF-8. if you have ideas on how to make this nicer I am happy to hear them.
          Hide
          Robert Muir added a comment -

          Thanks Mike, I will change the enum to reflect this.
          Currently I cheat and take advantage of this property (in trunk) to make the code simpler.

          Show
          Robert Muir added a comment - Thanks Mike, I will change the enum to reflect this. Currently I cheat and take advantage of this property (in trunk) to make the code simpler.
          Hide
          Michael McCandless added a comment -

          My understanding is that now \uFFFF can be in the index, and I can seek to it (it won't get replaced with \uFFFD).

          Yes, \uFFFF should be untouched now (though I haven't verified – actually I'll go add it to the test we already have for \uFFFF).

          Show
          Michael McCandless added a comment - My understanding is that now \uFFFF can be in the index, and I can seek to it (it won't get replaced with \uFFFD). Yes, \uFFFF should be untouched now (though I haven't verified – actually I'll go add it to the test we already have for \uFFFF).
          Hide
          Robert Muir added a comment -

          Mike,

          This is ok for the trunk, but I have a question about \uFFFF in flex (I guess we do not need to figure it out now, just think about it).
          My understanding is that now \uFFFF can be in the index, and I can seek to it (it won't get replaced with \uFFFD).
          From your comment this seems undefined at the moment, but for this enum I need to know, otherwise it will either skip \uFFFF terms, or go into a loop.

          Show
          Robert Muir added a comment - Mike, This is ok for the trunk, but I have a question about \uFFFF in flex (I guess we do not need to figure it out now, just think about it). My understanding is that now \uFFFF can be in the index, and I can seek to it (it won't get replaced with \uFFFD). From your comment this seems undefined at the moment, but for this enum I need to know, otherwise it will either skip \uFFFF terms, or go into a loop.
          Hide
          Robert Muir added a comment - - edited

          OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back...

          I guess what I am saying is I think the latest patch, which uses the isFinite() property of the DFA to determine whether or not to seek itself versus trying next(), is the best for both trunk and flex, but for different reasons?

          edit:
          the different reasons being: seek is expensive in trunk, because of the SegmentTermEnum clone()
          seek is "expensive" in flex, because doing a seek when we are in a loop entails unicode conversion, but next() avoids this with TermRef comparison.

          It gives the best of all the scores from https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12782198&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12782198

          In the future this could be refined, such that whether to try the extra next() or go ahead and seek() could instead be driven by whether or not we are 'ping-ponging against a loop', i.e. actually pingponging against a wildcard *, rather than computed up-front for the entire query. this can be determined from the state/transitions of the path being evaluated, but its not a one-liner!

          Show
          Robert Muir added a comment - - edited OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back... I guess what I am saying is I think the latest patch, which uses the isFinite() property of the DFA to determine whether or not to seek itself versus trying next(), is the best for both trunk and flex, but for different reasons? edit: the different reasons being: seek is expensive in trunk, because of the SegmentTermEnum clone() seek is "expensive" in flex, because doing a seek when we are in a loop entails unicode conversion, but next() avoids this with TermRef comparison. It gives the best of all the scores from https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12782198&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12782198 In the future this could be refined, such that whether to try the extra next() or go ahead and seek() could instead be driven by whether or not we are 'ping-ponging against a loop', i.e. actually pingponging against a wildcard *, rather than computed up-front for the entire query. this can be determined from the state/transitions of the path being evaluated, but its not a one-liner!
          Hide
          Michael McCandless added a comment -

          Mike by the way, I profiled the seeking on trunk, right at the top with 20% in hprof is the SegmentTermEnum clone... this is why at least for now, and on flex for different reasons, I think we should keep this stupid heuristic.

          OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back...

          Show
          Michael McCandless added a comment - Mike by the way, I profiled the seeking on trunk, right at the top with 20% in hprof is the SegmentTermEnum clone... this is why at least for now, and on flex for different reasons, I think we should keep this stupid heuristic. OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back...
          Hide
          Robert Muir added a comment -

          Mike by the way, I profiled the seeking on trunk, right at the top with 20% in hprof is the SegmentTermEnum clone... this is why at least for now, and on flex for different reasons, I think we should keep this stupid heuristic.

          But its improved now, because at least its using some knowledge of the DFA (whether or not it contains loops) to make this determination, thanks for the idea!

          Show
          Robert Muir added a comment - Mike by the way, I profiled the seeking on trunk, right at the top with 20% in hprof is the SegmentTermEnum clone... this is why at least for now, and on flex for different reasons, I think we should keep this stupid heuristic. But its improved now, because at least its using some knowledge of the DFA (whether or not it contains loops) to make this determination, thanks for the idea!
          Hide
          Michael McCandless added a comment -

          Make that "when to seek itself vs next() Lucene"

          Show
          Michael McCandless added a comment - Make that "when to seek itself vs next() Lucene"
          Hide
          Michael McCandless added a comment -

          But if I were to do this, then that would kill the TermRef comparison speedup, because then no matter how much i optimize "my seek" nextString(), it needs to do the unicode conversion, which we have seen is expensive across many terms.

          Right, I think ATE must still pick & choose when to seek itself vs seek Lucene, based on how costly nextString() is...

          Show
          Michael McCandless added a comment - But if I were to do this, then that would kill the TermRef comparison speedup, because then no matter how much i optimize "my seek" nextString(), it needs to do the unicode conversion, which we have seen is expensive across many terms. Right, I think ATE must still pick & choose when to seek itself vs seek Lucene, based on how costly nextString() is...
          Hide
          Robert Muir added a comment -

          But I don't think this'll impact your tests with a large suffix since each seek will jump way ahead to a new index block.

          Mike, if you add that optimization, that takes care of lucene itself, its smart enough to turn a seek into a read when it should, so I you might say I should simplify my code and just always seek.
          But if I were to do this, then that would kill the TermRef comparison speedup, because then no matter how much i optimize "my seek" nextString(), it needs to do the unicode conversion, which we have seen is expensive across many terms.

          Show
          Robert Muir added a comment - But I don't think this'll impact your tests with a large suffix since each seek will jump way ahead to a new index block. Mike, if you add that optimization, that takes care of lucene itself, its smart enough to turn a seek into a read when it should, so I you might say I should simplify my code and just always seek. But if I were to do this, then that would kill the TermRef comparison speedup, because then no matter how much i optimize "my seek" nextString(), it needs to do the unicode conversion, which we have seen is expensive across many terms.
          Hide
          Michael McCandless added a comment -

          (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close")

          Actually – I just remembered – flex branch is failing to do this optimization (there's already a nocommit reminding us to do it). Ie, it's always doing the binary search through the indexed terms... and not doing a scan when it determines the term you're seeking to is within the same index block.

          But I don't think this'll impact your tests with a large suffix since each seek will jump way ahead to a new index block.

          Show
          Michael McCandless added a comment - (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close") Actually – I just remembered – flex branch is failing to do this optimization (there's already a nocommit reminding us to do it). Ie, it's always doing the binary search through the indexed terms... and not doing a scan when it determines the term you're seeking to is within the same index block. But I don't think this'll impact your tests with a large suffix since each seek will jump way ahead to a new index block.
          Hide
          Robert Muir added a comment -

          sorry, wrong file. getting lost in iterations of this patch.

          Show
          Robert Muir added a comment - sorry, wrong file. getting lost in iterations of this patch.
          Hide
          Robert Muir added a comment -

          in this patch, if the automaton is finite, always seek.
          if its infinite, keep reading terms sequentially until a term fails (then seek)

          it seems to be the best of both worlds, and makes perfect sense to me.

          only thing that has me nervous is that SpecialOperations.isFinite() is defined recursively... will have to look into maybe trying to write this method iteratively, in case someone builds some monster automaton from a 2 page regexp or something like that.

          Show
          Robert Muir added a comment - in this patch, if the automaton is finite, always seek. if its infinite, keep reading terms sequentially until a term fails (then seek) it seems to be the best of both worlds, and makes perfect sense to me. only thing that has me nervous is that SpecialOperations.isFinite() is defined recursively... will have to look into maybe trying to write this method iteratively, in case someone builds some monster automaton from a 2 page regexp or something like that.
          Hide
          Robert Muir added a comment - - edited

          benchmark results from mike's idea. I don't use any heuristic, just remove the extra 'next' to show the tradeoffs.

          disclaimer: against trunk with LUCENE-2075

          Pattern Iter AvgHits AvgMS AvgMS (noNext)
          N?N?N?N 10 1000.0 37.5 28.4
          ?NNNNNN 10 10.0 6.4 6.1
          ??NNNNN 10 100.0 9.6 9.2
          ???NNNN 10 1000.0 52.7 40.9
          ????NNN 10 10000.0 300.7 262.3
          NN??NNN 10 100.0 4.9 4.1
          NN?N* 10 10000.0 9.6 28.9
          ?NN* 10 100000.0 80.4 235.4
          *N 10 1000000.0 3811.6 3747.5
          *NNNNNN 10 10.0 2098.3 2221.9
          NNNNN?? 10 100.0 2.2 2.4

          Mike my gut feeling, which will require a lot more testing, is that if the automaton accepts a finite language (in the wildcard case, no *), we should not do the next() call.
          but more benchmarking is needed, with more patterns, especially on flex branch to determine if this heuristic is best.

          Show
          Robert Muir added a comment - - edited benchmark results from mike's idea. I don't use any heuristic, just remove the extra 'next' to show the tradeoffs. disclaimer: against trunk with LUCENE-2075 Pattern Iter AvgHits AvgMS AvgMS (noNext) N?N?N?N 10 1000.0 37.5 28.4 ?NNNNNN 10 10.0 6.4 6.1 ??NNNNN 10 100.0 9.6 9.2 ???NNNN 10 1000.0 52.7 40.9 ????NNN 10 10000.0 300.7 262.3 NN??NNN 10 100.0 4.9 4.1 NN?N* 10 10000.0 9.6 28.9 ?NN* 10 100000.0 80.4 235.4 *N 10 1000000.0 3811.6 3747.5 *NNNNNN 10 10.0 2098.3 2221.9 NNNNN?? 10 100.0 2.2 2.4 Mike my gut feeling, which will require a lot more testing, is that if the automaton accepts a finite language (in the wildcard case, no *), we should not do the next() call. but more benchmarking is needed, with more patterns, especially on flex branch to determine if this heuristic is best.
          Hide
          Robert Muir added a comment -

          But the fixed trailing prefix case I think should be all seeks.

          I'll use regular expressions here, just to elaborate on this.

          what if its ab.[ab]? but the ab.[a-z]? ... where do you draw the line

          its also worth mentioning that the automaton "seek", nextString() is a lot more heavyweight right now than its "compare", which is extremely fast. its the DFA in tableized (array) form, just as an array lookup.
          This is why it beats even the hairy wildcard code we had before, after Uwe fixed my bug of course

          I think there are heuristics like you say we can do, and there's a lot of knowledge in the DFA we can use to implement these for optimal behavior.
          I think we can also improve the code itself, so that "seek", the nextString() method itself, is more lightweight.

          on the other hand the big unknown is the distribution of the term dictionary itself.

          I did a very basic implementation here, I'm hoping we can come up with better ideas that work well on average.
          One problem is, what is an "average" regular expression or wildcard query

          Show
          Robert Muir added a comment - But the fixed trailing prefix case I think should be all seeks. I'll use regular expressions here, just to elaborate on this. what if its ab. [ab] ? but the ab. [a-z] ? ... where do you draw the line its also worth mentioning that the automaton "seek", nextString() is a lot more heavyweight right now than its "compare", which is extremely fast. its the DFA in tableized (array) form, just as an array lookup. This is why it beats even the hairy wildcard code we had before, after Uwe fixed my bug of course I think there are heuristics like you say we can do, and there's a lot of knowledge in the DFA we can use to implement these for optimal behavior. I think we can also improve the code itself, so that "seek", the nextString() method itself, is more lightweight. on the other hand the big unknown is the distribution of the term dictionary itself. I did a very basic implementation here, I'm hoping we can come up with better ideas that work well on average. One problem is, what is an "average" regular expression or wildcard query
          Hide
          Michael McCandless added a comment -

          pretend ab* isn't rewritten to a prefixquery (it is, but there are more complex examples like this that cannot be)
          is it faster to seek 1M times and get the 1M terms, or just read them sequentially?

          This case should definitely be done sequentially.

          But the fixed trailing prefix case I think should be all seeks.

          That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek.

          I think that might be whats killing me, when i ran tests on lucene 2.9 or whatever.
          We should retest performance on flex.

          Well, the seeks need to be done anyway... so you can't work around that. The only question is if a wasted next() was done before each, I guess...

          Show
          Michael McCandless added a comment - pretend ab* isn't rewritten to a prefixquery (it is, but there are more complex examples like this that cannot be) is it faster to seek 1M times and get the 1M terms, or just read them sequentially? This case should definitely be done sequentially. But the fixed trailing prefix case I think should be all seeks. That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek. I think that might be whats killing me, when i ran tests on lucene 2.9 or whatever. We should retest performance on flex. Well, the seeks need to be done anyway... so you can't work around that. The only question is if a wasted next() was done before each, I guess...
          Hide
          Robert Muir added a comment -

          I guess here is the big question Mike, pretend ab* isn't rewritten to a prefixquery (it is, but there are more complex examples like this that cannot be)

          is it faster to seek 1M times and get the 1M terms, or just read them sequentially?
          furthermore to "seek" is not just lucene seek, I have to walk the transitions and compute the next place to go... (and create a few objects along the way)

          Show
          Robert Muir added a comment - I guess here is the big question Mike, pretend ab* isn't rewritten to a prefixquery (it is, but there are more complex examples like this that cannot be) is it faster to seek 1M times and get the 1M terms, or just read them sequentially? furthermore to "seek" is not just lucene seek, I have to walk the transitions and compute the next place to go... (and create a few objects along the way)
          Hide
          Robert Muir added a comment -

          But, take the abcd*1234 case - you first seek to abcd, the terms enum finds (say) abcdX1234 - don't you (DFA) know at this point that next possible candidate is abcdY1234? Ie, you should seek to that term? (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close").

          pretty sure I know this information, but in general i only seek on mismatches, I think for the reason that there can be a lot of seeks for AB* (say 1 million terms).
          so instead i wait for a mismatch until i seek again, I think tests showed this due to what you mentioned below?

          That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek.

          I think that might be whats killing me, when i ran tests on lucene 2.9 or whatever.
          We should retest performance on flex.

          This is why i said, significant rework of this maybe should take place in flex (although I still think this is an improvement for trunk already), to fully take advantage of it.

          Show
          Robert Muir added a comment - But, take the abcd*1234 case - you first seek to abcd, the terms enum finds (say) abcdX1234 - don't you (DFA) know at this point that next possible candidate is abcdY1234? Ie, you should seek to that term? (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close"). pretty sure I know this information, but in general i only seek on mismatches, I think for the reason that there can be a lot of seeks for AB* (say 1 million terms). so instead i wait for a mismatch until i seek again, I think tests showed this due to what you mentioned below? That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek. I think that might be whats killing me, when i ran tests on lucene 2.9 or whatever. We should retest performance on flex. This is why i said, significant rework of this maybe should take place in flex (although I still think this is an improvement for trunk already), to fully take advantage of it.
          Hide
          Michael McCandless added a comment -

          But, take the abcd*1234 case – you first seek to abcd, the terms enum finds (say) abcdX1234 – don't you (DFA) know at this point that next possible candidate is abcdY1234? Ie, you should seek to that term? (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close").

          That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek.

          Show
          Michael McCandless added a comment - But, take the abcd*1234 case – you first seek to abcd, the terms enum finds (say) abcdX1234 – don't you (DFA) know at this point that next possible candidate is abcdY1234? Ie, you should seek to that term? (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close"). That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek.
          Hide
          Robert Muir added a comment -

          mike, here is a more complex example of the ping-pong:

          a wildcard of abcd*12?4

          it has to seek to abcd first because of the * (the loop stops me)
          the ping-pong from the term dictionary, returns say abcdk1000
          it moves me past the giant monster.
          now i know enough to seek to abcdk12\u00004

          in this case its nice, the TermEnum moved me past it in one seek.
          sometimes its not so nice, if the TermEnum gave me abcda, i'm not past the monster.

          all i can do is generate the next possible term that won't put me into a DFA reject state, which is abcda\u0000... forcing the enum to move me forwards.
          maybe i seek to abcda\u0000, and it gives me abcdaa back, ill do the same thing again.

          the reason is, somewhere down the line there could be abcdaaaaaaaaaaaaaaaaaaaaaaaaaa1234

          Show
          Robert Muir added a comment - mike, here is a more complex example of the ping-pong: a wildcard of abcd*12?4 it has to seek to abcd first because of the * (the loop stops me) the ping-pong from the term dictionary, returns say abcdk1000 it moves me past the giant monster. now i know enough to seek to abcdk12\u00004 in this case its nice, the TermEnum moved me past it in one seek. sometimes its not so nice, if the TermEnum gave me abcda, i'm not past the monster. all i can do is generate the next possible term that won't put me into a DFA reject state, which is abcda\u0000... forcing the enum to move me forwards. maybe i seek to abcda\u0000, and it gives me abcdaa back, ill do the same thing again. the reason is, somewhere down the line there could be abcdaaaaaaaaaaaaaaaaaaaaaaaaaa1234
          Hide
          Robert Muir added a comment -

          I think I'm confused - if the query is ???1234, the common suffix is
          1234, and so shouldn't the DFA tell us the next XXX1234 term to try to
          seek to (and we should never use next() on the enum)?

          not really the suffix, but more general, the structure of the dfa itself tells us that for ???1234, if you are evaluating 5551234, that the next term should really be 5561234
          you can tell this by basically 'walking the graph'

          but this knowledge is not always available, sometimes we only have 'partial' knowledge of where to go next.
          The culprit here is the * operator, because it eats up anything
          So when you walk the graph, the * operator is this giant monster in your way.

          so sometimes, depending on the dfa (which depends on the wildcard or regex used to construct it),
          automaton knows enough to seek to all the exact locations, if its finite (like ???1234)

          sometimes, it only knows partial information. when a loop is encountered walking the graph (the giant monster), it has to stop and only use the path information it knows so far.
          for example a wildcard of abcd*1234, the best it can do is seek to abcd.

          your description of ping-pong is right, this is how these situations are handled.

          right now, the way the enum works, is i don't even bother seeking until i hit a mismatch.
          you can see this in the comments 'as long as there is a match, keep reading. this is an optimization when many terms are matched sequentially like ab*'

          i tested this along time ago, perhaps we should re-test to see if its appropriate?

          Show
          Robert Muir added a comment - I think I'm confused - if the query is ???1234, the common suffix is 1234, and so shouldn't the DFA tell us the next XXX1234 term to try to seek to (and we should never use next() on the enum)? not really the suffix, but more general, the structure of the dfa itself tells us that for ???1234, if you are evaluating 5551234, that the next term should really be 5561234 you can tell this by basically 'walking the graph' but this knowledge is not always available, sometimes we only have 'partial' knowledge of where to go next. The culprit here is the * operator, because it eats up anything So when you walk the graph, the * operator is this giant monster in your way. so sometimes, depending on the dfa (which depends on the wildcard or regex used to construct it), automaton knows enough to seek to all the exact locations, if its finite (like ???1234) sometimes, it only knows partial information. when a loop is encountered walking the graph (the giant monster), it has to stop and only use the path information it knows so far. for example a wildcard of abcd*1234, the best it can do is seek to abcd. your description of ping-pong is right, this is how these situations are handled. right now, the way the enum works, is i don't even bother seeking until i hit a mismatch. you can see this in the comments 'as long as there is a match, keep reading. this is an optimization when many terms are matched sequentially like ab*' i tested this along time ago, perhaps we should re-test to see if its appropriate?
          Hide
          Michael McCandless added a comment -

          Responding from LUCENE-2075...

          So I guess if there's a non-empty common suffix you should just always seek?

          the suffix is just for quick comparison, not used at all in seeking.

          I think I'm confused – if the query is ???1234, the common suffix is
          1234, and so shouldn't the DFA tell us the next XXX1234 term to try to
          seek to (and we should never use next() on the enum)?

          Show
          Michael McCandless added a comment - Responding from LUCENE-2075 ... So I guess if there's a non-empty common suffix you should just always seek? the suffix is just for quick comparison, not used at all in seeking. I think I'm confused – if the query is ???1234, the common suffix is 1234, and so shouldn't the DFA tell us the next XXX1234 term to try to seek to (and we should never use next() on the enum)?
          Hide
          Robert Muir added a comment -

          this patch removes constant prefix, as its only used in dumb mode, and in dumb mode there is no constant prefix.
          instead its replaced with constant suffix, which speeds up comparisons.

          constant suffix is implemented naively as reversing the DFA, taking its constant prefix, then reversing that.

          Show
          Robert Muir added a comment - this patch removes constant prefix, as its only used in dumb mode, and in dumb mode there is no constant prefix. instead its replaced with constant suffix, which speeds up comparisons. constant suffix is implemented naively as reversing the DFA, taking its constant prefix, then reversing that.
          Hide
          Uwe Schindler added a comment -

          yes.

          Show
          Uwe Schindler added a comment - yes.
          Hide
          Robert Muir added a comment -

          And I could still use this with "dumb mode"?, just one enum, right?

          Show
          Robert Muir added a comment - And I could still use this with "dumb mode"?, just one enum, right?
          Hide
          Uwe Schindler added a comment -

          I think the approach with nextEnum() would work for Automaton and NRQ, because both use this iteration approach. You have nextString() for repositioning, and I have a LinkedList (a stack) of pre-sorted range bounds.

          Show
          Uwe Schindler added a comment - I think the approach with nextEnum() would work for Automaton and NRQ, because both use this iteration approach. You have nextString() for repositioning, and I have a LinkedList (a stack) of pre-sorted range bounds.
          Hide
          Robert Muir added a comment -

          Yeah, but in general I think I already agree that FilteredTerm*s*Enum is easier for stuff like this.

          Either way its still tricky to make enums like this, so I am glad you are looking into it.

          Show
          Robert Muir added a comment - Yeah, but in general I think I already agree that FilteredTerm*s*Enum is easier for stuff like this. Either way its still tricky to make enums like this, so I am glad you are looking into it.
          Hide
          Uwe Schindler added a comment -

          OK, so doesn't affect NRQ, as it uses a different algo

          Show
          Uwe Schindler added a comment - OK, so doesn't affect NRQ, as it uses a different algo
          Hide
          Robert Muir added a comment -

          No, the main thing he did here that i like better, is that instead of caching the last comparison in termCompare(), he uses a boolean 'first'

          This still gives the optimization of 'don't seek in the term dictionary unless you get a mismatch, as long as you have matches, read sequentially'
          But in my opinion, its cleaner.

          Show
          Robert Muir added a comment - No, the main thing he did here that i like better, is that instead of caching the last comparison in termCompare(), he uses a boolean 'first' This still gives the optimization of 'don't seek in the term dictionary unless you get a mismatch, as long as you have matches, read sequentially' But in my opinion, its cleaner.
          Hide
          Uwe Schindler added a comment - - edited

          Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes.

          I am still thinking about an extension of FilteredTermEnum that works with these repositioning out of the box. But I have no good idea. The work in FilteredTerm*s*Enum is a good start, but may be extended, to also support something like a return value "JUMP_TO_NEXT_ENUM" and a mabstract method "nextEnum()" that returns null per default (no further enum).

          Show
          Uwe Schindler added a comment - - edited Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes. I am still thinking about an extension of FilteredTermEnum that works with these repositioning out of the box. But I have no good idea. The work in FilteredTerm*s*Enum is a good start, but may be extended, to also support something like a return value "JUMP_TO_NEXT_ENUM" and a mabstract method "nextEnum()" that returns null per default (no further enum).
          Hide
          Robert Muir added a comment -

          Patch line 6025.

          Thanks for reviewing the patch and catching this. I'm working on trying to finalize this.
          It already works fine for trunk, but I don't want it to suddenly break with the flex branch, so I'm adding a lot of tests and improvements in that regard.
          The current wildcard tests aren't sufficient anyway to tell if its really working.
          Also, when Mike ported it to the flex branch, he reorganized some code some in a way that I think is better, so I want to tie that in too.

          Show
          Robert Muir added a comment - Patch line 6025. Thanks for reviewing the patch and catching this. I'm working on trying to finalize this. It already works fine for trunk, but I don't want it to suddenly break with the flex branch, so I'm adding a lot of tests and improvements in that regard. The current wildcard tests aren't sufficient anyway to tell if its really working. Also, when Mike ported it to the flex branch, he reorganized some code some in a way that I think is better, so I want to tie that in too.
          Hide
          Robert Muir added a comment - - edited

          about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum?

          look at the code to nextString() itself.
          it uses cleanupPosition() which works differently.

          when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate.
          when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not)

          Show
          Robert Muir added a comment - - edited about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum? look at the code to nextString() itself. it uses cleanupPosition() which works differently. when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate. when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not)
          Hide
          Uwe Schindler added a comment -

          about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum?

          Show
          Uwe Schindler added a comment - about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum?
          Hide
          Uwe Schindler added a comment -

          Uwe, where do you see UTF-38

          Patch line 6025.

          Show
          Uwe Schindler added a comment - Uwe, where do you see UTF-38 Patch line 6025.
          Hide
          Robert Muir added a comment -

          Uwe, where do you see UTF-38

          Show
          Robert Muir added a comment - Uwe, where do you see UTF-38
          Hide
          Robert Muir added a comment - - edited

          i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba[\uFFFC\uFFFF]. in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\uFFFF, but the FFFF will get replaced by FFFD by the byte conversion, and we will loop.

          mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to...

          Show
          Robert Muir added a comment - - edited i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba [\uFFFC\uFFFF] . in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\uFFFF, but the FFFF will get replaced by FFFD by the byte conversion, and we will loop. mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to...
          Hide
          Uwe Schindler added a comment -

          what is UTF-38? I think you mean UTF-32, if such exists.

          Else it looks good!

          Show
          Uwe Schindler added a comment - what is UTF-38? I think you mean UTF-32, if such exists. Else it looks good!
          Hide
          Robert Muir added a comment -

          sorry, my ide added a @author tag. i need to look to see where to turn this @author generation off for eclipse.

          Show
          Robert Muir added a comment - sorry, my ide added a @author tag. i need to look to see where to turn this @author generation off for eclipse.
          Hide
          Robert Muir added a comment -

          updated patch:

          • don't seek to high surrogates, instead tack on \uDC00. this still works for trunk, but also with flex branch.
          • don't use a high surrogate prefix, instead truncate. this isn't being used at all, i would rather use 'constant suffix'
          • add tests that will break if lucene's sort order is not UTF-16 (or if automaton is not adjusted to the new sort order)
          • add another enum constructor, where you can specify smart or dumb mode yourself
          • regexp javadoc note
          • add wordage to LICENSE, not just NOTICE
          Show
          Robert Muir added a comment - updated patch: don't seek to high surrogates, instead tack on \uDC00. this still works for trunk, but also with flex branch. don't use a high surrogate prefix, instead truncate. this isn't being used at all, i would rather use 'constant suffix' add tests that will break if lucene's sort order is not UTF-16 (or if automaton is not adjusted to the new sort order) add another enum constructor, where you can specify smart or dumb mode yourself regexp javadoc note add wordage to LICENSE, not just NOTICE
          Hide
          Robert Muir added a comment -

          Yonik, maybe we can use this trick?

          UTF-8 in UTF-16 Order
          The following comparison function for UTF-8 yields the same results as UTF-16 binary
          comparison. In the code, notice that it is necessary to do extra work only once per string,
          not once per byte. That work can consist of simply remapping through a small array; there
          are no extra conditional branches that could slow down the processing.

          int strcmp8like16(unsigned char* a, unsigned char* b) {
            while (true) {
            int ac = *a++;
            int bc = *b++;
            if (ac != bc) return rotate[ac] - rotate[bc];
            if (ac == 0) return 0;
            }
          }
          
          static char rotate[256] =
          {0x00, ..., 0x0F,
          0x10, ..., 0x1F,
          . .
          . .
          . .
          0xD0, ..., 0xDF,
          0xE0, ..., 0xED, 0xF0, 0xF1,
          0xF2, 0xF3, 0xF4, 0xEE, 0xEF, 0xF5, ..., 0xFF};
          

          The rotate array is formed by taking an array of 256 bytes from 0x00 to 0xFF, and rotating
          0xEE and 0xEF to a position after the bytes 0xF0..0xF4. These rotated values are shown in
          boldface. When this rotation is performed on the initial bytes of UTF-8, it has the effect of
          making code points U+10000..U+10FFFF sort below U+E000..U+FFFF, thus mimicking
          the ordering of UTF-16.

          Show
          Robert Muir added a comment - Yonik, maybe we can use this trick? UTF-8 in UTF-16 Order The following comparison function for UTF-8 yields the same results as UTF-16 binary comparison. In the code, notice that it is necessary to do extra work only once per string, not once per byte. That work can consist of simply remapping through a small array; there are no extra conditional branches that could slow down the processing. int strcmp8like16(unsigned char * a, unsigned char * b) { while ( true ) { int ac = *a++; int bc = *b++; if (ac != bc) return rotate[ac] - rotate[bc]; if (ac == 0) return 0; } } static char rotate[256] = {0x00, ..., 0x0F, 0x10, ..., 0x1F, . . . . . . 0xD0, ..., 0xDF, 0xE0, ..., 0xED, 0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xEE, 0xEF, 0xF5, ..., 0xFF}; The rotate array is formed by taking an array of 256 bytes from 0x00 to 0xFF, and rotating 0xEE and 0xEF to a position after the bytes 0xF0..0xF4. These rotated values are shown in boldface. When this rotation is performed on the initial bytes of UTF-8, it has the effect of making code points U+10000..U+10FFFF sort below U+E000..U+FFFF, thus mimicking the ordering of UTF-16.
          Hide
          Yonik Seeley added a comment -

          geeze... maybe we should have just stuck with CESU-8

          Show
          Yonik Seeley added a comment - geeze... maybe we should have just stuck with CESU-8
          Hide
          Robert Muir added a comment -

          I spent a while with this, thinking I would be slick and create a version of Automaton that works with both trunk and flex branch correctly.
          Finally, i figured it out, this is not possible.

          There is no bug with the current version, because in trunk, IndexReader.terms() uses UTF-16 binary order.
          In the flex branch, it uses UTF-8 binary order.

          I can emulate UTF-8 binary order in the enum, but then it won't work correctly on trunk, but will work on flex branch!

          This enum is sensitive to the order of terms coming in...

          doh!

          Show
          Robert Muir added a comment - I spent a while with this, thinking I would be slick and create a version of Automaton that works with both trunk and flex branch correctly. Finally, i figured it out, this is not possible. There is no bug with the current version, because in trunk, IndexReader.terms() uses UTF-16 binary order. In the flex branch, it uses UTF-8 binary order. I can emulate UTF-8 binary order in the enum, but then it won't work correctly on trunk, but will work on flex branch! This enum is sensitive to the order of terms coming in... doh!
          Hide
          Robert Muir added a comment -

          I think i have a workaround for this enum that will not hurt performance as well.
          There are two problems, one exists with the existing api, one will become a problem with the flex API if we move to byte[] TermRef, which, from performance numbers, it seems we almost certainly should.

          • problem 1 is the fact that this enum depends upon 'sorted transitions' where each transition is an interval of UTF-16 characters.
            To fix this, two things are required:
          1. splitting intervals that overlap between the BMP and surrogate range into separate intervals
          2. sorting intervals in codepoint order, which means ordering the surrogate range intervals after any BMP intervals.
          • problem 2 is the fact the enum works on UTF-16 characters again, and we can try to seek in the enumerator to a place ending with a lead surrogate.
          1. In this case, we should just tack on U+DC00 (the lowest of trail surrogates), which is functionally equivalent.
          2. for the "common prefix" we just remove any trail surrogates, although this common prefix is currently not even used at all, so we should get rid of it, if the first state is not a loop we are in smart mode anyway!

          I'll fix these problems, by providing a new "codepoint-order" comparator for transitions behind the scenes in automaton, along with a getSortedTransitionsCodepointOrder() or something similar to make the whole thing work.

          it might seem at a glance that using 'int' (UTF-32 intervals) instead is a better fix, but this is not true, because it would cause a RunAutomaton to use 1MB memory where it currently uses 64KB, only for these stupid rare cases.

          Show
          Robert Muir added a comment - I think i have a workaround for this enum that will not hurt performance as well. There are two problems, one exists with the existing api, one will become a problem with the flex API if we move to byte[] TermRef, which, from performance numbers, it seems we almost certainly should. problem 1 is the fact that this enum depends upon 'sorted transitions' where each transition is an interval of UTF-16 characters. To fix this, two things are required: splitting intervals that overlap between the BMP and surrogate range into separate intervals sorting intervals in codepoint order, which means ordering the surrogate range intervals after any BMP intervals. problem 2 is the fact the enum works on UTF-16 characters again, and we can try to seek in the enumerator to a place ending with a lead surrogate. In this case, we should just tack on U+DC00 (the lowest of trail surrogates), which is functionally equivalent. for the "common prefix" we just remove any trail surrogates, although this common prefix is currently not even used at all, so we should get rid of it, if the first state is not a loop we are in smart mode anyway! I'll fix these problems, by providing a new "codepoint-order" comparator for transitions behind the scenes in automaton, along with a getSortedTransitionsCodepointOrder() or something similar to make the whole thing work. it might seem at a glance that using 'int' (UTF-32 intervals) instead is a better fix, but this is not true, because it would cause a RunAutomaton to use 1MB memory where it currently uses 64KB, only for these stupid rare cases.
          Hide
          Robert Muir added a comment - - edited

          Mike, just one comment here.

          I am definitely willing to do refactoring here to support this byte[] scheme if necessary, I don't want to give the wrong impression. I think i already have an issue here related to UTF-16 binary order vs UTF-8 binary order that I need to fix, although I think this is just writing a Comparator.

          edit: pretty sure this exists. If someone has say, both data from Arabic Presentation forms and Chinese text outside the BMP in the index, the "smart" enumerator will unknowningly skip right past the Arabic Presentation forms block, because it sorts after the lead surrogate in UTF-16 order, but before the entire codepoint in UTF-8/UTF-32 order. I have not experienced this in practice, because i normalize my text so i don't have stuff in Arabic Presentation forms block I can fix this, but i would like to see what the approach is for the flex branch, as its sufficiently compex that I would rather not fix it twice.

          I am just concerned about other similar applications outside of lucene, or some already in lucene core itself!

          Show
          Robert Muir added a comment - - edited Mike, just one comment here. I am definitely willing to do refactoring here to support this byte[] scheme if necessary, I don't want to give the wrong impression. I think i already have an issue here related to UTF-16 binary order vs UTF-8 binary order that I need to fix, although I think this is just writing a Comparator. edit: pretty sure this exists. If someone has say, both data from Arabic Presentation forms and Chinese text outside the BMP in the index, the "smart" enumerator will unknowningly skip right past the Arabic Presentation forms block, because it sorts after the lead surrogate in UTF-16 order, but before the entire codepoint in UTF-8/UTF-32 order. I have not experienced this in practice, because i normalize my text so i don't have stuff in Arabic Presentation forms block I can fix this, but i would like to see what the approach is for the flex branch, as its sufficiently compex that I would rather not fix it twice. I am just concerned about other similar applications outside of lucene, or some already in lucene core itself!
          Hide
          Robert Muir added a comment -

          Michael, the problem is this code (automaton itself), like many other code, is unaware of supplementary characters.
          It uses a symbolic interval range of 'char' for state transitions.
          But this is ok! When matching an input string with suppl. characters, things work just fine.

          This is one reason why i am concerned about the change to byte[] in flex branch.
          I would have to rewrite this DFA library for this enum to work!

          Show
          Robert Muir added a comment - Michael, the problem is this code (automaton itself), like many other code, is unaware of supplementary characters. It uses a symbolic interval range of 'char' for state transitions. But this is ok! When matching an input string with suppl. characters, things work just fine. This is one reason why i am concerned about the change to byte[] in flex branch. I would have to rewrite this DFA library for this enum to work!
          Hide
          Michael McCandless added a comment -

          Mike, here is an update to your flex patch. I restored back two tests that disappeared (TestRegexp, etc)

          Woops, thanks! Need svn patch, badly...

          Also, I converted the enum to use char[] as an experiment. i posted the results on LUCENE-2090.
          this is just a hack, it stores the UTF16Result in AutomatonEnum
          I figured i would pass it back in case, just in the case you wanted to play more.

          Wow, not "new String()"ing all over gave a sizable gain on the full linear scan query...

          Show
          Michael McCandless added a comment - Mike, here is an update to your flex patch. I restored back two tests that disappeared (TestRegexp, etc) Woops, thanks! Need svn patch, badly... Also, I converted the enum to use char[] as an experiment. i posted the results on LUCENE-2090 . this is just a hack, it stores the UTF16Result in AutomatonEnum I figured i would pass it back in case, just in the case you wanted to play more. Wow, not "new String()"ing all over gave a sizable gain on the full linear scan query...
          Hide
          Robert Muir added a comment -

          Mike, here is an update to your flex patch. I restored back two tests that disappeared (TestRegexp, etc)
          Also, I converted the enum to use char[] as an experiment. i posted the results on LUCENE-2090.
          this is just a hack, it stores the UTF16Result in AutomatonEnum
          I figured i would pass it back in case, just in the case you wanted to play more.

          Show
          Robert Muir added a comment - Mike, here is an update to your flex patch. I restored back two tests that disappeared (TestRegexp, etc) Also, I converted the enum to use char[] as an experiment. i posted the results on LUCENE-2090 . this is just a hack, it stores the UTF16Result in AutomatonEnum I figured i would pass it back in case, just in the case you wanted to play more.
          Hide
          Michael McCandless added a comment -

          I opened one for Automaton specifically, should i change it to be all MTQ?

          Oh, sorry no, just automaton.

          nope. because unicode and java are optimized for UTF-16, not UTF-32.

          OK char[] it is!

          Show
          Michael McCandless added a comment - I opened one for Automaton specifically, should i change it to be all MTQ? Oh, sorry no, just automaton. nope. because unicode and java are optimized for UTF-16, not UTF-32. OK char[] it is!
          Hide
          Robert Muir added a comment -

          Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]).

          I wanted to make another comment here, I agree that this somewhat like byte[].
          But there are some major differences:

          1. the java API provides mechanisms in Character, etc for processing text this way.
          2. lots of stuff is unaffected. for example .startsWith() is not broken for supp characters.
            it does not have to use codepoint anywhere, can just compare chars, which are surrogates, but this is ok.
            so lots of char[]-based processing is already compatible, and completely unaware of this issue. this is not true for byte[]
          3. it will perform the best overall, its only needed in very few places and we can be very careful where we add these checks, so we don't slow anything down or increase RAM usage, etc.
          Show
          Robert Muir added a comment - Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]). I wanted to make another comment here, I agree that this somewhat like byte[]. But there are some major differences: the java API provides mechanisms in Character, etc for processing text this way. lots of stuff is unaffected. for example .startsWith() is not broken for supp characters. it does not have to use codepoint anywhere, can just compare chars, which are surrogates, but this is ok. so lots of char[]-based processing is already compatible, and completely unaware of this issue. this is not true for byte[] it will perform the best overall, its only needed in very few places and we can be very careful where we add these checks, so we don't slow anything down or increase RAM usage, etc.
          Hide
          Robert Muir added a comment -

          I thought you were about to open one!

          I opened one for Automaton specifically, should i change it to be all MTQ?

          Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]).

          nope. because unicode and java are optimized for UTF-16, not UTF-32. so we should use char[], but use the codePoint apis, which are designed such that you can process text in UTF-16 (char[]) efficiently, yet also handle the rare case of supp. characters.
          char[] is correct, its just that we have to be careful to use the right apis for processing it.
          With String() a lot of the apis such as String.toLowerCase do this automatically for you, so most applications have no issues.

          Show
          Robert Muir added a comment - I thought you were about to open one! I opened one for Automaton specifically, should i change it to be all MTQ? Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]). nope. because unicode and java are optimized for UTF-16, not UTF-32. so we should use char[], but use the codePoint apis, which are designed such that you can process text in UTF-16 (char[]) efficiently, yet also handle the rare case of supp. characters. char[] is correct, its just that we have to be careful to use the right apis for processing it. With String() a lot of the apis such as String.toLowerCase do this automatically for you, so most applications have no issues.
          Hide
          Michael McCandless added a comment -

          is there a jira issue for this??

          I thought you were about to open one!

          I am just trying to think of ways to make it both faster, at the same time easy too.

          Which is great: keep it up!

          Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]).

          Show
          Michael McCandless added a comment - is there a jira issue for this?? I thought you were about to open one! I am just trying to think of ways to make it both faster, at the same time easy too. Which is great: keep it up! Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]).
          Hide
          Robert Muir added a comment -

          We can discuss this under the new [separately] "optimization" issue for MTQs?

          is there a jira issue for this??

          Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters.

          oh yeah, its clear the flex api is already better, from benchmarks.
          I am just trying to think of ways to make it both faster, at the same time easy too.

          Show
          Robert Muir added a comment - We can discuss this under the new [separately] "optimization" issue for MTQs? is there a jira issue for this?? Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters. oh yeah, its clear the flex api is already better, from benchmarks. I am just trying to think of ways to make it both faster, at the same time easy too.
          Hide
          Michael McCandless added a comment -

          Besides TermsEnum.. TermRef is used by the terms dict, when doing the binary search + scan to find a term. And also by TermsConsumer (implemented by the codec, and used when writing a segment to the index).

          Maybe MTQ holds the state, or FilteredTermsEnum? Other consumers of TermsEnum don't need to convert to char[].

          We can discuss this under the new [separately] "optimization" issue for MTQs?

          Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters.

          Show
          Michael McCandless added a comment - Besides TermsEnum.. TermRef is used by the terms dict, when doing the binary search + scan to find a term. And also by TermsConsumer (implemented by the codec, and used when writing a segment to the index). Maybe MTQ holds the state, or FilteredTermsEnum? Other consumers of TermsEnum don't need to convert to char[]. We can discuss this under the new [separately] "optimization" issue for MTQs? Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters.
          Hide
          Robert Muir added a comment -

          I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm.

          i guess the state could be in the TermsEnum, but that doesnt make for general use of TermRef.
          what else uses TermRef?

          Show
          Robert Muir added a comment - I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm. i guess the state could be in the TermsEnum, but that doesnt make for general use of TermRef. what else uses TermRef?
          Hide
          Michael McCandless added a comment -

          it would be nice I think if TermRef provided a helper method to make the char[] available?

          I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm.

          Show
          Michael McCandless added a comment - it would be nice I think if TermRef provided a helper method to make the char[] available? I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm.
          Hide
          Robert Muir added a comment -

          Oh, that'd be great! It would be faster. I was bummed at how many new String()'s I was doing...

          it would be nice I think if TermRef provided a helper method to make the char[] available?
          i.e. I don't think i should do unicode conversion in a multitermquery?

          Show
          Robert Muir added a comment - Oh, that'd be great! It would be faster. I was bummed at how many new String()'s I was doing... it would be nice I think if TermRef provided a helper method to make the char[] available? i.e. I don't think i should do unicode conversion in a multitermquery?
          Hide
          Michael McCandless added a comment -

          One question, is it possible to speed this up further, by using UnicodeUtil/char[] conversion from TermRef instead of String?
          Because its trivial to use char[] with the Automaton api (even tho that is not exposed, its no problem)

          I use only string because of the old TermEnum api.

          Oh, that'd be great! It would be faster. I was bummed at how many new String()'s I was doing...

          Show
          Michael McCandless added a comment - One question, is it possible to speed this up further, by using UnicodeUtil/char[] conversion from TermRef instead of String? Because its trivial to use char[] with the Automaton api (even tho that is not exposed, its no problem) I use only string because of the old TermEnum api. Oh, that'd be great! It would be faster. I was bummed at how many new String()'s I was doing...
          Hide
          Robert Muir added a comment -

          Mike, I think your port to TermsEnum is correct, and its definitely faster here.

          One question, is it possible to speed this up further, by using UnicodeUtil/char[] conversion from TermRef instead of String?
          Because its trivial to use char[] with the Automaton api (even tho that is not exposed, its no problem)

          I use only string because of the old TermEnum api.

          Show
          Robert Muir added a comment - Mike, I think your port to TermsEnum is correct, and its definitely faster here. One question, is it possible to speed this up further, by using UnicodeUtil/char[] conversion from TermRef instead of String? Because its trivial to use char[] with the Automaton api (even tho that is not exposed, its no problem) I use only string because of the old TermEnum api.
          Hide
          Robert Muir added a comment -

          Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest...

          Mike, this is cool. I like the results. it appears tentatively, that flex api is faster for both "dumb" (brute force linear reading) and "fast" (lots of seeking) modes.
          at least looking at ????NNN, and *N, which are the worst cases of both here. So it would seem its faster in every case.

          I'll look at what you did to port this to the TermsEnum api!

          Show
          Robert Muir added a comment - Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest... Mike, this is cool. I like the results. it appears tentatively, that flex api is faster for both "dumb" (brute force linear reading) and "fast" (lots of seeking) modes. at least looking at ????NNN, and *N, which are the worst cases of both here. So it would seem its faster in every case. I'll look at what you did to port this to the TermsEnum api!
          Hide
          Michael McCandless added a comment -

          First cut @ cutting over to flex API attached – note that this
          applies to the flex branch, not trunk!

          I made some small changes to the benchmarker: use constant score
          filter mode, and print the min (not avg) time (less noise).

          Also, I ported the AutomatonTermEnum to the flex API, so this is now a
          better measure ("flex on flex") of what future perf will be. It's
          possible there's a bug here, though TestWildcard passes.

          I still need to investigate why "non-flex on non-flex" and "non-flex
          on flex" perform worse.

          I ran like this:

          java -server -Xmx1g -Xms1g BenchWildcard

          java is 1.6.0_14 64 bit, on OpenSolaris.

          Results (msec is min of 10 runs each);

          Pattern ITrunk (min msec) (Flex (min msec)
          N?N?N?N0.0 13 18
          ?NNNNNN 1 3
          ??NNNNN 4 6
          ???NNNN 23 28
          ????NNN 210 170
          NN??NNN 3 3
          NN?N* 7 4
          ?NN* 62 30
          *N 4332 2576
          NNNNN?? 1 1

          Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest...

          Show
          Michael McCandless added a comment - First cut @ cutting over to flex API attached – note that this applies to the flex branch, not trunk! I made some small changes to the benchmarker: use constant score filter mode, and print the min (not avg) time (less noise). Also, I ported the AutomatonTermEnum to the flex API, so this is now a better measure ("flex on flex") of what future perf will be. It's possible there's a bug here, though TestWildcard passes. I still need to investigate why "non-flex on non-flex" and "non-flex on flex" perform worse. I ran like this: java -server -Xmx1g -Xms1g BenchWildcard java is 1.6.0_14 64 bit, on OpenSolaris. Results (msec is min of 10 runs each); Pattern ITrunk (min msec) (Flex (min msec) N?N?N?N0.0 13 18 ?NNNNNN 1 3 ??NNNNN 4 6 ???NNNN 23 28 ????NNN 210 170 NN??NNN 3 3 NN?N* 7 4 ?NN* 62 30 *N 4332 2576 NNNNN?? 1 1 Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest...
          Hide
          Michael McCandless added a comment -

          OK that warning seems good. Maybe also reference contrib/regex, as another alternative, nothing that syntax/capabilities are different?

          Show
          Michael McCandless added a comment - OK that warning seems good. Maybe also reference contrib/regex, as another alternative, nothing that syntax/capabilities are different?
          Hide
          Robert Muir added a comment -

          we call this out nicely in the current RegexQuery,
          The expressions supported depend on the regular expression implementation used by way of the RegexCapabilities interface.

          what should I say for the automaton implementation? it already has a javadoc link to the precise syntax supported,
          so in my opinion its actually less ambiguous than contrib RegexQuery.

          but maybe improve this, instead of

          The supported syntax is documented in the {@link RegExp} class.
          

          maybe:

          The supported syntax is documented in the {@link RegExp} class.
          warning: this might not be the syntax you are used to!
          
          Show
          Robert Muir added a comment - we call this out nicely in the current RegexQuery, The expressions supported depend on the regular expression implementation used by way of the RegexCapabilities interface. what should I say for the automaton implementation? it already has a javadoc link to the precise syntax supported, so in my opinion its actually less ambiguous than contrib RegexQuery. but maybe improve this, instead of The supported syntax is documented in the {@link RegExp} class. maybe: The supported syntax is documented in the {@link RegExp} class. warning: this might not be the syntax you are used to!
          Hide
          Robert Muir added a comment - - edited

          If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another.

          from a glance, it appears to me that both the syntax and functionality of our two contrib impls (java.util and jakarta), are very different.

          here is one example. Java.util supports reluctant

          {m,n} closures, jakarta does not, it says this right in the javadocs.
          http://jakarta.apache.org/regexp/apidocs/org/apache/regexp/RE.html

          Should RE support reluctant {m,n}

          closures (does anyone care)?
          But it supports reluctant versus greedy for other operators.

          in automaton, this concept of reluctance versus greedy, does not even exist, as spelled out on their page:
          The * operator is mathematically the Kleene star operator (i.e. we don't have greedy/reluctant/possesive variants).
          http://www.brics.dk/automaton/faq.html

          this is an example, where all 3 are different... i guess i kinda assumed everyone was aware that all these regex packages are very different.

          Show
          Robert Muir added a comment - - edited If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another. from a glance, it appears to me that both the syntax and functionality of our two contrib impls (java.util and jakarta), are very different. here is one example. Java.util supports reluctant {m,n} closures, jakarta does not, it says this right in the javadocs. http://jakarta.apache.org/regexp/apidocs/org/apache/regexp/RE.html Should RE support reluctant {m,n} closures (does anyone care)? But it supports reluctant versus greedy for other operators. in automaton, this concept of reluctance versus greedy, does not even exist, as spelled out on their page: The * operator is mathematically the Kleene star operator (i.e. we don't have greedy/reluctant/possesive variants). http://www.brics.dk/automaton/faq.html this is an example, where all 3 are different... i guess i kinda assumed everyone was aware that all these regex packages are very different.
          Hide
          Michael McCandless added a comment -

          I don't have any wording – I don't really know the differences

          If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another.

          Show
          Michael McCandless added a comment - I don't have any wording – I don't really know the differences If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another.
          Hide
          Robert Muir added a comment -

          Would be good to call out what's different in the javadocs somewhere...

          good idea, let me know if you have some suggested wording.
          its really a general issue, the supported features and syntax of even the existing regex implementations in contrib are different I think?
          (i.e. they are not compatible: you cannot just swap impls around, without testing that it supports the syntax and features you are using)

          Show
          Robert Muir added a comment - Would be good to call out what's different in the javadocs somewhere... good idea, let me know if you have some suggested wording. its really a general issue, the supported features and syntax of even the existing regex implementations in contrib are different I think? (i.e. they are not compatible: you cannot just swap impls around, without testing that it supports the syntax and features you are using)
          Hide
          Michael McCandless added a comment -

          Are we going to deprecate contrib/regex with this?

          I would argue against that, only because the other regex impl's have different features and syntax, even if they are slow.

          Ahh OK, I agree then. I didn't realize this new query doesn't subsume contrib's. Would be good to call out what's different in the javadocs somewhere...

          Show
          Michael McCandless added a comment - Are we going to deprecate contrib/regex with this? I would argue against that, only because the other regex impl's have different features and syntax, even if they are slow. Ahh OK, I agree then. I didn't realize this new query doesn't subsume contrib's. Would be good to call out what's different in the javadocs somewhere...
          Hide
          Robert Muir added a comment -

          Are we going to deprecate contrib/regex with this?

          I would argue against that, only because the other regex impl's have different features and syntax, even if they are slow.

          Show
          Robert Muir added a comment - Are we going to deprecate contrib/regex with this? I would argue against that, only because the other regex impl's have different features and syntax, even if they are slow.
          Hide
          Michael McCandless added a comment -

          Are we going to deprecate contrib/regex with this?

          Show
          Michael McCandless added a comment - Are we going to deprecate contrib/regex with this?
          Hide
          Robert Muir added a comment - - edited

          Mark, one last comment. I want to mention this impl is largely unoptimized (from a code perspective, but the algorithm is better)
          I think you see that from the NNNNN?? being 2.3ms on average versus 1.9, not that I am sure that isnt just a random hiccup.

          So I want to incorporate the logic of some of this benchmark into the tests, so that we can improve the actual code impl. to speed up cases like that.
          While i focus on the scalability, i know a lot of people have small indexes and maybe lots of qps and I don't want to slow them down.

          Some of this is easy, for example we make State.getSortedTransitionArray public, so we don't have to convert from arrays to lists to arrays and such, for no good reason.

          Show
          Robert Muir added a comment - - edited Mark, one last comment. I want to mention this impl is largely unoptimized (from a code perspective, but the algorithm is better) I think you see that from the NNNNN?? being 2.3ms on average versus 1.9, not that I am sure that isnt just a random hiccup. So I want to incorporate the logic of some of this benchmark into the tests, so that we can improve the actual code impl. to speed up cases like that. While i focus on the scalability, i know a lot of people have small indexes and maybe lots of qps and I don't want to slow them down. Some of this is easy, for example we make State.getSortedTransitionArray public, so we don't have to convert from arrays to lists to arrays and such, for no good reason.
          Hide
          Robert Muir added a comment -

          Mark, yeah lets create a separate issue for fuzzy.
          I found someone implemented that algorithm in python or some other language, we should look/contact them to see what they did.

          currently, I am trying to check this LUCENE-2075 issue, to see if this caching will help cases where the enum must seek a lot, for example the pattern ????NNN
          it is still better than the current wildcard case, but you see it gets a lot worse when you have a lot more seeks.

          i think though, this means i have to cut over to the new FilteredTermsEnum api for the flex branch... which looks interesting btw but this is a complicated enum.

          Show
          Robert Muir added a comment - Mark, yeah lets create a separate issue for fuzzy. I found someone implemented that algorithm in python or some other language, we should look/contact them to see what they did. currently, I am trying to check this LUCENE-2075 issue, to see if this caching will help cases where the enum must seek a lot, for example the pattern ????NNN it is still better than the current wildcard case, but you see it gets a lot worse when you have a lot more seeks. i think though, this means i have to cut over to the new FilteredTermsEnum api for the flex branch... which looks interesting btw but this is a complicated enum.
          Hide
          Mark Miller added a comment -

          You are the man Robert. This is going to be great.

          Next on my wish list is getting the scalable fuzzy done We should start a new issue for that, seeding it with the info you have here. If you don't get to it, I'll be happy to.

          Still on my list to help with review on this patch too. Thanks Uwe as well! Love seeing this stuff make its way into core.

          Show
          Mark Miller added a comment - You are the man Robert. This is going to be great. Next on my wish list is getting the scalable fuzzy done We should start a new issue for that, seeding it with the info you have here. If you don't get to it, I'll be happy to. Still on my list to help with review on this patch too. Thanks Uwe as well! Love seeing this stuff make its way into core.
          Hide
          Robert Muir added a comment -

          Thanks Mike, it is not that impressive really, until you look at regex performance

          The current regexp implementations will scan entire term dictionary for an expression like "[dl]og?", because there is no 'constant prefix'
          The idea here, is that lucene should be smart enough to look for do, dog, lo, and log.

          Show
          Robert Muir added a comment - Thanks Mike, it is not that impressive really, until you look at regex performance The current regexp implementations will scan entire term dictionary for an expression like " [dl] og?", because there is no 'constant prefix' The idea here, is that lucene should be smart enough to look for do, dog, lo, and log.
          Hide
          Michael McCandless added a comment -

          Those are impressive gains!

          Show
          Michael McCandless added a comment - Those are impressive gains!
          Hide
          Robert Muir added a comment -

          attached is benchmark, which generates random wildcard queries.
          it builds an index of 10million docs, each with a term from 0-10million.
          it will fill a pattern such as N?N?N?N? with random digits, substituting a random digit for N.

          Pattern Iter AvgHits AvgMS (old) AvgMS (new)
          N?N?N?N 10 1000.0 288.6 38.5
          ?NNNNNN 10 10.0 2453.1 6.4
          ??NNNNN 10 100.0 2484.2 10.1
          ???NNNN 10 1000.0 2821.3 47.8
          ????NNN 10 10000.0 2346.9 299.8
          NN??NNN 10 100.0 34.8 6.3
          NN?N* 10 10000.0 26.5 9.4
          ?NN* 10 100000.0 2009.0 73.5
          *N 10 1000000.0 6837.4 6087.9
          NNNNN?? 10 100.0 1.9 2.3

          i would like to incorporate part of this logic into the junit tests, on maybe a smaller index, because its how i found the recursion bug.

          Show
          Robert Muir added a comment - attached is benchmark, which generates random wildcard queries. it builds an index of 10million docs, each with a term from 0-10million. it will fill a pattern such as N?N?N?N? with random digits, substituting a random digit for N. Pattern Iter AvgHits AvgMS (old) AvgMS (new) N?N?N?N 10 1000.0 288.6 38.5 ?NNNNNN 10 10.0 2453.1 6.4 ??NNNNN 10 100.0 2484.2 10.1 ???NNNN 10 1000.0 2821.3 47.8 ????NNN 10 10000.0 2346.9 299.8 NN??NNN 10 100.0 34.8 6.3 NN?N* 10 10000.0 26.5 9.4 ?NN* 10 100000.0 2009.0 73.5 *N 10 1000000.0 6837.4 6087.9 NNNNN?? 10 100.0 1.9 2.3 i would like to incorporate part of this logic into the junit tests, on maybe a smaller index, because its how i found the recursion bug.
          Hide
          Robert Muir added a comment -

          this patch fixes a bug i introduced when i removed recursion.
          the wildcard tests do not detect it... told you i didnt trust them

          I will add a test for this, although it was an obvious mistake on my part.

          Show
          Robert Muir added a comment - this patch fixes a bug i introduced when i removed recursion. the wildcard tests do not detect it... told you i didnt trust them I will add a test for this, although it was an obvious mistake on my part.
          Hide
          Robert Muir added a comment -

          I reran my tests, Uwe's fix removes this '5%' problem I mentioned before for leading *.
          Now wildcardquery is always faster than before (before it was comparing terms from another field due to the endEnum bug)

          This makes sense, because RunAutomaton.run() is just array access, instead of all the conditional/branching in the old wildcardEquals.
          But i could not figure out before for the life of me, why this was slower before!

          I will create a better benchmark now that generates lots of random numeric wildcards with lots of patterns, and post the results and code.

          Show
          Robert Muir added a comment - I reran my tests, Uwe's fix removes this '5%' problem I mentioned before for leading *. Now wildcardquery is always faster than before (before it was comparing terms from another field due to the endEnum bug) This makes sense, because RunAutomaton.run() is just array access, instead of all the conditional/branching in the old wildcardEquals. But i could not figure out before for the life of me, why this was slower before! I will create a better benchmark now that generates lots of random numeric wildcards with lots of patterns, and post the results and code.
          Hide
          Robert Muir added a comment -

          fix the ø in NOTICE, cleanup some unused imports, etc.

          now that Uwe fixed a performance bug in the dumb enum (it would never set endEnum=true but instead false),
          I will dig up my old performance tests and see how we are looking.

          Show
          Robert Muir added a comment - fix the ø in NOTICE, cleanup some unused imports, etc. now that Uwe fixed a performance bug in the dumb enum (it would never set endEnum=true but instead false), I will dig up my old performance tests and see how we are looking.
          Hide
          Robert Muir added a comment -

          Uwe, lookout for that ø in the NOTICE.txt... I will fix it, thanks for your cleanups

          Show
          Robert Muir added a comment - Uwe, lookout for that ø in the NOTICE.txt... I will fix it, thanks for your cleanups
          Hide
          Uwe Schindler added a comment -

          Again some updates, moved the '*' and '?' constants also to WildcardQuery and use them in switch. Also added better deprecation messeg to WildcardTermEnum.

          Show
          Uwe Schindler added a comment - Again some updates, moved the '*' and '?' constants also to WildcardQuery and use them in switch. Also added better deprecation messeg to WildcardTermEnum.
          Hide
          Uwe Schindler added a comment -

          Some cleanups and a more consistent endEnum handling. Also added Javadocs explaining smart and linear mode.

          Show
          Uwe Schindler added a comment - Some cleanups and a more consistent endEnum handling. Also added Javadocs explaining smart and linear mode.
          Hide
          Michael McCandless added a comment -

          I'm not following this very closely, but, it looks really really cool!

          Show
          Michael McCandless added a comment - I'm not following this very closely, but, it looks really really cool!
          Hide
          Robert Muir added a comment -

          Uwe, thank you. This is much nicer!

          I think now it will be easier for some subclass to extend this enum, for example to override difference() or whatever for fuzzy.

          Show
          Robert Muir added a comment - Uwe, thank you. This is much nicer! I think now it will be easier for some subclass to extend this enum, for example to override difference() or whatever for fuzzy.
          Hide
          Uwe Schindler added a comment -

          Hi Robert,

          here is my patch. The WildCard and RegExp test querys still pass. I also added a test for the deprec TermEnum (just a simple MTQ that returns it is used and should produce same results as WildcardQuery).

          The AutomatonTermEnum now switches between smart(R) and non-smart mode using your detection algorithm. termCompare now handles both cases.
          next() just calls super in the linear case (so it behaves like a normal FilteredTermEnum) and uses the smart(R) code in all other cases.

          I will go to bed now, tell me if you like it.

          Show
          Uwe Schindler added a comment - Hi Robert, here is my patch. The WildCard and RegExp test querys still pass. I also added a test for the deprec TermEnum (just a simple MTQ that returns it is used and should produce same results as WildcardQuery). The AutomatonTermEnum now switches between smart(R) and non-smart mode using your detection algorithm. termCompare now handles both cases. next() just calls super in the linear case (so it behaves like a normal FilteredTermEnum) and uses the smart(R) code in all other cases. I will go to bed now, tell me if you like it.
          Hide
          Robert Muir added a comment - - edited

          That would make the enum ugly... But would work.

          This is why i did not do it (i tried and it was ugly), i did not want to make a complicated enum ugly!
          I'll try to think of how this can be done without it being so ugly.

          edit, by the way Uwe, if you are bored and want to take a stab at this You know your way around multitermquery better than me.

          Show
          Robert Muir added a comment - - edited That would make the enum ugly... But would work. This is why i did not do it (i tried and it was ugly), i did not want to make a complicated enum ugly! I'll try to think of how this can be done without it being so ugly. edit, by the way Uwe, if you are bored and want to take a stab at this You know your way around multitermquery better than me.
          Hide
          Uwe Schindler added a comment -

          see LUCENE-2075, why it is not so fast (the setEnum calls use seeking and this is not optimized by the TermCache). Yonik has poited us to that.

          If the dumb enumeration would be included inside AutomatonTermEnum, one could use it without thinking. I would like to move your posted code into AutomatonTermEnum and have two modi dumb and intelligent. This would need an if switch on each next() call and a delegation to super.next(). That would make the enum ugly... But would work. So just fold the LinearTermEnum into it and make a switch: if (linearMode) return super.next();
          But you have to remove the assert inside endEnum() and change it. In the intelligent case, the endEnum method is never called (because super.next() is never called). So the assert must be assert linearMode;
          termCompare looks identical in both enums, for the indelligent case the comonPrefix is "".

          Show
          Uwe Schindler added a comment - see LUCENE-2075 , why it is not so fast (the setEnum calls use seeking and this is not optimized by the TermCache). Yonik has poited us to that. If the dumb enumeration would be included inside AutomatonTermEnum, one could use it without thinking. I would like to move your posted code into AutomatonTermEnum and have two modi dumb and intelligent. This would need an if switch on each next() call and a delegation to super.next(). That would make the enum ugly... But would work. So just fold the LinearTermEnum into it and make a switch: if (linearMode) return super.next(); But you have to remove the assert inside endEnum() and change it. In the intelligent case, the endEnum method is never called (because super.next() is never called). So the assert must be assert linearMode; termCompare looks identical in both enums, for the indelligent case the comonPrefix is "".
          Hide
          Robert Muir added a comment -

          by the way Uwe, I do not particularly like how this AutomatonQuery "decides to use smart or dumb termenum" in getEnum() works.
          I wish instead the AutomatonTermEnum would always be fast, instead of relying on the query to decide.
          I think this would be cleaner, and make subclassing easier.

          but on the other hand, having these two separate, it makes things easy to understand, as the two methods work in two completely different ways.
          i wonder if you have any ideas on this.

          Show
          Robert Muir added a comment - by the way Uwe, I do not particularly like how this AutomatonQuery "decides to use smart or dumb termenum" in getEnum() works. I wish instead the AutomatonTermEnum would always be fast, instead of relying on the query to decide. I think this would be cleaner, and make subclassing easier. but on the other hand, having these two separate, it makes things easy to understand, as the two methods work in two completely different ways. i wonder if you have any ideas on this.
          Hide
          Robert Muir added a comment -

          We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code".

          if we do this, there are lots of cases where it will perform better, yes (virtually anything involving ? operator)
          but if we do this, there are also some cases where it won't perform quite as well, really bad wildcards where it is better to just do linear scan than skip around many many times.
          This is why i have detection for these cases, in the getEnum() instead I return "DumbTermEnum" aka LinearTermEnum in AutomatonQuery.
          if you think this is no problem, we can subclass it anyway. excerpt below:

              /*
               * If the DFA has a leading kleene star, or something similar, it will
               * need to run against the entire term dictionary. In this case its much
               * better to do just that than to use fancy enumeration.
               * 
               * this heuristic looks for an initial loop, with a range of at least 1/3
               * of the unicode BMP.
               */
              State state = automaton.getInitialState();
              for (Transition transition : state.getTransitions())
                if (transition.getDest() == state
                    && (transition.getMax() - transition.getMin()) > (Character.MAX_VALUE / 3))
                  return new LinearTermEnum(reader);
          
              return new AutomatonTermEnum(automaton, term, reader);
          
          Show
          Robert Muir added a comment - We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code". if we do this, there are lots of cases where it will perform better, yes (virtually anything involving ? operator) but if we do this, there are also some cases where it won't perform quite as well, really bad wildcards where it is better to just do linear scan than skip around many many times. This is why i have detection for these cases, in the getEnum() instead I return "DumbTermEnum" aka LinearTermEnum in AutomatonQuery. if you think this is no problem, we can subclass it anyway. excerpt below: /* * If the DFA has a leading kleene star, or something similar, it will * need to run against the entire term dictionary. In this case its much * better to do just that than to use fancy enumeration. * * this heuristic looks for an initial loop, with a range of at least 1/3 * of the unicode BMP. */ State state = automaton.getInitialState(); for (Transition transition : state.getTransitions()) if (transition.getDest() == state && (transition.getMax() - transition.getMin()) > ( Character .MAX_VALUE / 3)) return new LinearTermEnum(reader); return new AutomatonTermEnum(automaton, term, reader);
          Hide
          Uwe Schindler added a comment - - edited

          Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor.

          That was my idea!

          This is where all the logic really is anyway.

          We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code".

          The good thing: the method is final (this is what I see from yor fragment) - so nobody was able to override it to change the behaviour of the enum, so nothing can break.

          I would go this way.

          Show
          Uwe Schindler added a comment - - edited Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor. That was my idea! This is where all the logic really is anyway. We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code". The good thing: the method is final (this is what I see from yor fragment) - so nobody was able to override it to change the behaviour of the enum, so nothing can break. I would go this way.
          Hide
          Robert Muir added a comment -

          Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor.

            public WildcardTermEnum(IndexReader reader, Term term) throws IOException {
              super(WildcardQuery.toAutomaton(term), term, reader);
            }
          

          The problem is that this hardly removes any duplicated code, because we must keep this available:

           public static final boolean wildcardEquals(String pattern, int patternIdx,
              String string, int stringIdx)
            {
          

          This is where all the logic really is anyway.
          So I think i would rather leave this one alone? But I will add a test for it, to ensure it doesn't break since we are not using it.
          What do you think?

          Show
          Robert Muir added a comment - Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor. public WildcardTermEnum(IndexReader reader, Term term) throws IOException { super (WildcardQuery.toAutomaton(term), term, reader); } The problem is that this hardly removes any duplicated code, because we must keep this available: public static final boolean wildcardEquals( String pattern, int patternIdx, String string, int stringIdx) { This is where all the logic really is anyway. So I think i would rather leave this one alone? But I will add a test for it, to ensure it doesn't break since we are not using it. What do you think?
          Hide
          Robert Muir added a comment -

          Uwe, both your ideas are great. thank you for looking.
          I will take a stab at those.

          Show
          Robert Muir added a comment - Uwe, both your ideas are great. thank you for looking. I will take a stab at those.
          Hide
          Robert Muir added a comment -

          As far as testing, one of the simple things we can try is generating random wildcard strings against a large corpus and auto comparing the results of the old and the new.

          An idea i have here is ORP-2 corpus, it has approx 417K unique terms and 160K docs.
          and its open, so anyone could participate.

          Show
          Robert Muir added a comment - As far as testing, one of the simple things we can try is generating random wildcard strings against a large corpus and auto comparing the results of the old and the new. An idea i have here is ORP-2 corpus, it has approx 417K unique terms and 160K docs. and its open, so anyone could participate.
          Hide
          Uwe Schindler added a comment -

          I like it, too, some thoughts:

          • Maybe make AutomatonTermEnum public instead package private (if it maybe extended and of usage for own sub classes like a future FuzzyQuery to return a difference())
          • The code in WildcardTermEnum is deprecated but still there and teherefor duplicated functionality. Maybe we could make this class subclass of AutomatonTermEnum, but it initializes to be a simple WildCard. The TermEnum has no longer a test (the deprecated one), so we maybe must add a test.
          Show
          Uwe Schindler added a comment - I like it, too, some thoughts: Maybe make AutomatonTermEnum public instead package private (if it maybe extended and of usage for own sub classes like a future FuzzyQuery to return a difference()) The code in WildcardTermEnum is deprecated but still there and teherefor duplicated functionality. Maybe we could make this class subclass of AutomatonTermEnum, but it initializes to be a simple WildCard. The TermEnum has no longer a test (the deprecated one), so we maybe must add a test.
          Hide
          Robert Muir added a comment -

          Mark, thanks, let me know if you have the chance to look more thoroughly.

          I agree, lets consider some ideas for testing wildcards, yours sounds good.
          One problem I had is trying to figure out: "what is the average/common case" for wildcards/regex
          Its important also when considering some additional optimizations i havent yet implemented.

          also, I think i might have some additional wildcards tests from the contrib patch.
          I left TestWildCard completely as-is for now tho, b/c i thought it would be nice to show it passes unchanged.

          Show
          Robert Muir added a comment - Mark, thanks, let me know if you have the chance to look more thoroughly. I agree, lets consider some ideas for testing wildcards, yours sounds good. One problem I had is trying to figure out: "what is the average/common case" for wildcards/regex Its important also when considering some additional optimizations i havent yet implemented. also, I think i might have some additional wildcards tests from the contrib patch. I left TestWildCard completely as-is for now tho, b/c i thought it would be nice to show it passes unchanged.
          Hide
          Mark Miller added a comment -

          Nice! Resulting jar is still just 1.0 MB. Looks great on a quick look through. I'll go over more thoroughly when I get a chance.

          As far as testing, one of the simple things we can try is generating random wildcard strings against a large corpus and auto comparing the results of the old and the new.

          +1 on the automaton name for the util package.

          I'd almost still prefer RegexQuery - its contrib vs core and different packages - I hate to lose out on the better name. Though thats a bit subjective

          Show
          Mark Miller added a comment - Nice! Resulting jar is still just 1.0 MB. Looks great on a quick look through. I'll go over more thoroughly when I get a chance. As far as testing, one of the simple things we can try is generating random wildcard strings against a large corpus and auto comparing the results of the old and the new. +1 on the automaton name for the util package. I'd almost still prefer RegexQuery - its contrib vs core and different packages - I hate to lose out on the better name. Though thats a bit subjective
          Hide
          Robert Muir added a comment -

          Mark, I think this patch is ok, all tests pass etc.
          Can you take a look and let me know your thoughts?

          Show
          Robert Muir added a comment - Mark, I think this patch is ok, all tests pass etc. Can you take a look and let me know your thoughts?
          Hide
          Robert Muir added a comment -

          Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think

          Mark, the old WildcardTermEnum is public, so we must keep it around for a while anyway.
          I can use it for this case, so we don't lose this 5% in the special case

          Might be worth deprecating this old WildcardTermEnum still though, just because its code to be maintained, hardly used except for this purpose.

          Show
          Robert Muir added a comment - Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think Mark, the old WildcardTermEnum is public, so we must keep it around for a while anyway. I can use it for this case, so we don't lose this 5% in the special case Might be worth deprecating this old WildcardTermEnum still though, just because its code to be maintained, hardly used except for this purpose.
          Hide
          Robert Muir added a comment -

          Mark, ok. In that case I will not include these, and rename the pkg as you suggest.
          These default named automata are not enabled by default in the library anyway if you use the RegExp() default constructor.
          the (renamed and pared) api is still extensible, if you want to create named automata to use in your regular expressions, you just implement the very simple interface DatatypesAutomatonProvider.
          then you pass this to the constructor of RegexpQuery

          Show
          Robert Muir added a comment - Mark, ok. In that case I will not include these, and rename the pkg as you suggest. These default named automata are not enabled by default in the library anyway if you use the RegExp() default constructor. the (renamed and pared) api is still extensible, if you want to create named automata to use in your regular expressions, you just implement the very simple interface DatatypesAutomatonProvider. then you pass this to the constructor of RegexpQuery
          Hide
          Mark Miller added a comment -

          On the way hand I'd say, well lets not rename the package then - its not that important. But these things could get out of sync anyway, so I'm not sure its worth it to try and maintain some sort of compatibility. If these features are useful enough, we could end up adding them later. Your call though. Personally, I'd think we start just by adding the essentials and build from there as makes sense.

          Show
          Mark Miller added a comment - On the way hand I'd say, well lets not rename the package then - its not that important. But these things could get out of sync anyway, so I'm not sure its worth it to try and maintain some sort of compatibility. If these features are useful enough, we could end up adding them later. Your call though. Personally, I'd think we start just by adding the essentials and build from there as makes sense.
          Hide
          Robert Muir added a comment -

          OK we have the start of a plan, only one final nit I am worried about.

          I pared away the 'built-in named automata':

          • example <Lu> (uppercase letter, from Unicode)
          • example <QName> (from XML)

          if we keep the original pkg name, a user can have these just by adding brics.jar into their path.
          They would just pass new DatatypesAutomatonProvider() to the constructor of RegexpQuery, done.

          if we rename the pkg, this will not work because the DataTypesAutomatonProvider from the jar file implements dk.brics.automaton.AutomatonProvider,
          not o.a.l.util.brics.AutomatonProvider.

          alternatively, we could rename the pkg, but I could restore perhaps a subset of these datatypes, maybe without all the xml ones, just the basic unicode categories and stuff?
          This would cost a little space though. Here is the list:
          http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html#get%28java.lang.String%29

          i ask this question because personally i don't use any of these built-ins, but users might want them? what do you think?

          Show
          Robert Muir added a comment - OK we have the start of a plan, only one final nit I am worried about. I pared away the 'built-in named automata': example <Lu> (uppercase letter, from Unicode) example <QName> (from XML) if we keep the original pkg name, a user can have these just by adding brics.jar into their path. They would just pass new DatatypesAutomatonProvider() to the constructor of RegexpQuery, done. if we rename the pkg, this will not work because the DataTypesAutomatonProvider from the jar file implements dk.brics.automaton.AutomatonProvider, not o.a.l.util.brics.AutomatonProvider. alternatively, we could rename the pkg, but I could restore perhaps a subset of these datatypes, maybe without all the xml ones, just the basic unicode categories and stuff? This would cost a little space though. Here is the list: http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html#get%28java.lang.String%29 i ask this question because personally i don't use any of these built-ins, but users might want them? what do you think?
          Hide
          Mark Miller added a comment -

          We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%).

          Agreed - not worth the extra code for speeding up such a horrible case by 5%.

          Any other ideas?

          I'd rather change the contrib names and let core have the good name We can start with RegexpQuery I think.

          o.a.l.util.brics?

          Thats my best thought at the moment.

          Show
          Mark Miller added a comment - We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%). Agreed - not worth the extra code for speeding up such a horrible case by 5%. Any other ideas? I'd rather change the contrib names and let core have the good name We can start with RegexpQuery I think. o.a.l.util.brics? Thats my best thought at the moment.
          Hide
          Robert Muir added a comment -

          Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery?

          We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%).
          I was actually surprised the wildcardquery logic beats a DFA, i guess something to be said for that hairy logic

          Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery.

          I do not know, all regex is not created equal. This one has different syntax and stuff from the other impl's.
          Any other ideas? Obviously the name RegexpQuery, with a p, is available

          Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file.

          ok, this is easy, i already have NOTICE in the patch. i was sure all files from brics have their license header also.

          I think so - perhaps util.brics? No need for dk I don't think.

          o.a.l.util.brics?

          Show
          Robert Muir added a comment - Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery? We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%). I was actually surprised the wildcardquery logic beats a DFA, i guess something to be said for that hairy logic Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery. I do not know, all regex is not created equal. This one has different syntax and stuff from the other impl's. Any other ideas? Obviously the name RegexpQuery, with a p, is available Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file. ok, this is easy, i already have NOTICE in the patch. i was sure all files from brics have their license header also. I think so - perhaps util.brics? No need for dk I don't think. o.a.l.util.brics?
          Hide
          Mark Miller added a comment -

          I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?

          Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery?

          but then what should AutomatonRegexQuery be called, we already have RegexQuery?

          Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery.

          Thoughts on the automaton src code? Should I reformat to our style... (I did not do this).

          Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file.

          should we rename the pkg?

          I think so - perhaps util.brics? No need for dk I don't think.

          sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)?

          One large patch is fine with me - my IDE will make short work of groking it

          Show
          Mark Miller added a comment - I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery? Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery? but then what should AutomatonRegexQuery be called, we already have RegexQuery? Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery. Thoughts on the automaton src code? Should I reformat to our style... (I did not do this). Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file. should we rename the pkg? I think so - perhaps util.brics? No need for dk I don't think. sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)? One large patch is fine with me - my IDE will make short work of groking it
          Hide
          Robert Muir added a comment - - edited

          attached is an alternate patch with no library dependency (LUCENE-1606_nodep.patch)
          instead it imports 'pared-down' automaton source code (compiles to 48KB jar)
          it is still setup in contrib regex because...

          Mark: some practical questions, I'd like to create a patch that integrates it nicely into core, just so we can see what it would look like.
          Thoughts on class names and pkg names?

          • I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?
          • but then what should AutomatonRegexQuery be called, we already have RegexQuery

          Thoughts on the automaton src code? Should I reformat to our style... (I did not do this).
          should we rename the pkg?

          sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)?
          also, i did not remove any tests, for example, TestWildcardQuery already exists, so the test here is just duplication, i just might add a test or 2 to the existing TestWildcardQuery

          Show
          Robert Muir added a comment - - edited attached is an alternate patch with no library dependency ( LUCENE-1606 _nodep.patch) instead it imports 'pared-down' automaton source code (compiles to 48KB jar) it is still setup in contrib regex because... Mark: some practical questions, I'd like to create a patch that integrates it nicely into core, just so we can see what it would look like. Thoughts on class names and pkg names? I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery? but then what should AutomatonRegexQuery be called, we already have RegexQuery Thoughts on the automaton src code? Should I reformat to our style... (I did not do this). should we rename the pkg? sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)? also, i did not remove any tests, for example, TestWildcardQuery already exists, so the test here is just duplication, i just might add a test or 2 to the existing TestWildcardQuery
          Hide
          Robert Muir added a comment -

          Mark, ok. I will supply a new patch with no lib dependency, instead it includes the pared Automaton code in one pkg.
          this compiles to about a 48KB jar right now. Reducing it more would involve sacrificing readability or useful stuff.

          Example: keeping the "Matcher" would be useful if you want to use this for a really fast 'PatternTokenizer', but not needed here.

          Show
          Robert Muir added a comment - Mark, ok. I will supply a new patch with no lib dependency, instead it includes the pared Automaton code in one pkg. this compiles to about a 48KB jar right now. Reducing it more would involve sacrificing readability or useful stuff. Example: keeping the "Matcher" would be useful if you want to use this for a really fast 'PatternTokenizer', but not needed here.
          Hide
          Mark Miller added a comment -

          Point taken - the tests are not perfect. They never are. But it doesn't stop us chugging along We can always write more tests and trunk tends to get quite a work out if you put changes in towards the beginning of a dev cycle. Bugs are inevitable, in trunk or contrib. But I don't think the wildcard impl will get much exposure in contrib anyway - its not wired into the queryparser, and it won't come with a sign saying check this out. Users will still use the standard wildcardquery - and I want to see it improved. We can work out the patch, work out the tests, and then decided its not good enough - or perhaps another committer will look and decided that. I'd still love to put it in core myself.

          Show
          Mark Miller added a comment - Point taken - the tests are not perfect. They never are. But it doesn't stop us chugging along We can always write more tests and trunk tends to get quite a work out if you put changes in towards the beginning of a dev cycle. Bugs are inevitable, in trunk or contrib. But I don't think the wildcard impl will get much exposure in contrib anyway - its not wired into the queryparser, and it won't come with a sign saying check this out. Users will still use the standard wildcardquery - and I want to see it improved. We can work out the patch, work out the tests, and then decided its not good enough - or perhaps another committer will look and decided that. I'd still love to put it in core myself.
          Hide
          Robert Muir added a comment -

          How do you want to improve them?

          well for one, i test all the rewrite methods and boosts here.
          ok these are also now fixed as of 3.0 in core wildcard tests also (LUCENE-1951), but those were two 'buglets', just an example.

          Show
          Robert Muir added a comment - How do you want to improve them? well for one, i test all the rewrite methods and boosts here. ok these are also now fixed as of 3.0 in core wildcard tests also ( LUCENE-1951 ), but those were two 'buglets', just an example.
          Hide
          Mark Miller added a comment -

          i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug?

          If they caught that, they wouldn't catch another How do you want to improve them? I'll help test and write tests - we can make something much more intensive if you'd like, and then just put a flag to tone it down for normal test running.

          ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere.

          Right - don't go further than makes sense - even 53k -> 20k - I don't think it really matters that much. So really I meant, as small as makes reasonable sense.

          Show
          Mark Miller added a comment - i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug? If they caught that, they wouldn't catch another How do you want to improve them? I'll help test and write tests - we can make something much more intensive if you'd like, and then just put a flag to tone it down for normal test running. ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere. Right - don't go further than makes sense - even 53k -> 20k - I don't think it really matters that much. So really I meant, as small as makes reasonable sense.
          Hide
          Robert Muir added a comment -

          What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough?

          i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug?

          As small as possible

          ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere.

          Show
          Robert Muir added a comment - What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough? i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug? As small as possible ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere.
          Hide
          Mark Miller added a comment -

          I think trying this out around in contrib (after 3.0 is released) would be best in the short term?

          What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough?

          Do you have a target size I should shoot for?

          As small as possible But I don't personally have any issue adding 53k to the core jar for this goodness. Guess we will have to see what others say - but its a low percentage of the current 1.1 MB, and pretty sweet functionality.

          Show
          Mark Miller added a comment - I think trying this out around in contrib (after 3.0 is released) would be best in the short term? What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough? Do you have a target size I should shoot for? As small as possible But I don't personally have any issue adding 53k to the core jar for this goodness. Guess we will have to see what others say - but its a low percentage of the current 1.1 MB, and pretty sweet functionality.
          Hide
          Robert Muir added a comment -

          So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it

          I think trying this out around in contrib (after 3.0 is released) would be best in the short term?

          Separately, my quickly 'pared' automaton library is now 53KB jar (14 java files, some just simple POJO)
          Do you have a target size I should shoot for?

          Show
          Robert Muir added a comment - So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it I think trying this out around in contrib (after 3.0 is released) would be best in the short term? Separately, my quickly 'pared' automaton library is now 53KB jar (14 java files, some just simple POJO) Do you have a target size I should shoot for?
          Hide
          Robert Muir added a comment -

          That is a cool tradeoff to be able to make.

          Mark, yes. I guess someone could implement the DFA-reversing if they wanted to, to enable leading .* regex support with ReverseStringFilter.
          you can still use this Wildcard impl with ReverseStringFilter just like the core Wildcard impl, because its just so easy to reverse a wildcard string.

          but you don't want to try to reverse a regular expression! that would be hairy. easier to reverse a DFA.

          but even without this, there are tons of workarounds, like the tradeoff i mentioned earlier.
          also, another one that might not be apparent is that its only the leading .* that is a problem, depending on corpus of course.

          [a-z].*abacadaba will avoid visiting terms that start with 1,2,3 or are in chinese, etc, which might be a nice improvement.
          of course if all your terms start with a-z, then its gonna be the same as entering .*abacadaba, and be bad.

          all depends on how selective the regular expression is wrt your terms.

          Show
          Robert Muir added a comment - That is a cool tradeoff to be able to make. Mark, yes. I guess someone could implement the DFA-reversing if they wanted to, to enable leading .* regex support with ReverseStringFilter. you can still use this Wildcard impl with ReverseStringFilter just like the core Wildcard impl, because its just so easy to reverse a wildcard string. but you don't want to try to reverse a regular expression! that would be hairy. easier to reverse a DFA. but even without this, there are tons of workarounds, like the tradeoff i mentioned earlier. also, another one that might not be apparent is that its only the leading .* that is a problem, depending on corpus of course. [a-z] .*abacadaba will avoid visiting terms that start with 1,2,3 or are in chinese, etc, which might be a nice improvement. of course if all your terms start with a-z, then its gonna be the same as entering .*abacadaba, and be bad. all depends on how selective the regular expression is wrt your terms.
          Hide
          Mark Miller added a comment -

          That is a cool tradeoff to be able to make.

          Show
          Mark Miller added a comment - That is a cool tradeoff to be able to make.
          Hide
          Robert Muir added a comment - - edited

          Does it make sense to implement here though?

          I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary.
          I found out what was acceptable (clarification: to these specific users/system) was for * to actually match .

          {0,3}

          (between 0 and 3 of anything), and rewrote it to an equivalent regex like this.
          This performed very well, because it can still avoid comparing many terms.

          Show
          Robert Muir added a comment - - edited Does it make sense to implement here though? I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary. I found out what was acceptable (clarification: to these specific users/system) was for * to actually match . {0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this. This performed very well, because it can still avoid comparing many terms.
          Hide
          Mark Miller added a comment -

          Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think - though you've taken care of that anyway - so sounds like gold to me. I didn't expect this to solve leading wildcard issues, so no loss to me.

          No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries

          bummer Does it make sense to implement here though? Isn't the ReverseStringFilter enough if a user wants to go this route? Solr's support for this is fairly good, but I don't think it needs to be as 'built in' for Lucene?

          Show
          Mark Miller added a comment - Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think - though you've taken care of that anyway - so sounds like gold to me. I didn't expect this to solve leading wildcard issues, so no loss to me. No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries bummer Does it make sense to implement here though? Isn't the ReverseStringFilter enough if a user wants to go this route? Solr's support for this is fairly good, but I don't think it needs to be as 'built in' for Lucene?
          Hide
          Robert Muir added a comment - - edited

          No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

          I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem,
          i mean depending on your term dictionary if you do something stupid like ???????abacadaba it probably wont be that fast.

          I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives.
          There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests)
          Because of this, currently this patch rewrites this very special case to a standard WildcardQuery.

          Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

          No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries

          Show
          Robert Muir added a comment - - edited No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this? I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem, i mean depending on your term dictionary if you do something stupid like ???????abacadaba it probably wont be that fast. I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives. There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests) Because of this, currently this patch rewrites this very special case to a standard WildcardQuery. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool. No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries
          Hide
          Mark Miller added a comment - - edited

          By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times.
          So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
          in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.

          No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

          but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading *
          I didnt implement this here though yet.

          Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

          Show
          Mark Miller added a comment - - edited By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow. No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this? but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet. Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.
          Hide
          Robert Muir added a comment -

          By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times.
          So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow
          in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.

          but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading *
          I didnt implement this here though yet.

          Show
          Robert Muir added a comment - By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow. but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet.
          Hide
          Robert Muir added a comment -

          You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup.

          Right, i created a FuzzyQuery that builds in the 'naive' method. The problem is that for large strings this exponential-time naive mechanism creates a rather large NFA, and the NFA->DFA conversion is very slow.
          Once the DFA is built, actually running it on a term dictionary is fast
          So the slow part has nothing to do with lucene at all.

          So we just need to build these DFAs in an efficient way:
          We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W
          http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

          Show
          Robert Muir added a comment - You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup. Right, i created a FuzzyQuery that builds in the 'naive' method. The problem is that for large strings this exponential-time naive mechanism creates a rather large NFA, and the NFA->DFA conversion is very slow. Once the DFA is built, actually running it on a term dictionary is fast So the slow part has nothing to do with lucene at all. So we just need to build these DFAs in an efficient way: We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
          Hide
          Robert Muir added a comment -

          So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it

          Mark, some notes on size. jarring up the full source code (no paring) is 81KB.
          in practice, the jar file is larger because it contains some 'precompiled DFAs' for certain things like Unicode blocks, XML types... are these really needed?

          see here for a list of what I mean: http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html
          I enabled these in the patch (they could be easily disabled): an example of how they are used in a regexp is like this: <Arabic>* (match 0 or more arabic characters)

          if a user really wanted them, they can load them themselves, you can also create custom ones and use a DataTypesAutomatonProvider to register them for some name:
          Example: your users want to be able to use <make> or <model> inside their regexps, you can register <make> and <model> to match to some DFA you make yourself.
          its really a nice mechanism, but I don't think we need all the precompiled ones?

          Show
          Robert Muir added a comment - So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it Mark, some notes on size. jarring up the full source code (no paring) is 81KB. in practice, the jar file is larger because it contains some 'precompiled DFAs' for certain things like Unicode blocks, XML types... are these really needed? see here for a list of what I mean: http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html I enabled these in the patch (they could be easily disabled): an example of how they are used in a regexp is like this: <Arabic>* (match 0 or more arabic characters) if a user really wanted them, they can load them themselves, you can also create custom ones and use a DataTypesAutomatonProvider to register them for some name: Example: your users want to be able to use <make> or <model> inside their regexps, you can register <make> and <model> to match to some DFA you make yourself. its really a nice mechanism, but I don't think we need all the precompiled ones?
          Hide
          Mark Miller added a comment - - edited

          So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it

          You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup.

          Show
          Mark Miller added a comment - - edited So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup.
          Hide
          Uwe Schindler added a comment -

          3.0 is just the switch to 1.5 and generics. So this is a typical java 1.5 issue and can go into 3.0 even if it is a new feature. Contrib is not core and may have own rules.

          In my opinion, this would be a nice addition to the regex contrib and should also have been in 2.9, but the underlying library is Java 5 only, so we had to wait until 3.0.

          Show
          Uwe Schindler added a comment - 3.0 is just the switch to 1.5 and generics. So this is a typical java 1.5 issue and can go into 3.0 even if it is a new feature. Contrib is not core and may have own rules. In my opinion, this would be a nice addition to the regex contrib and should also have been in 2.9, but the underlying library is Java 5 only, so we had to wait until 3.0.
          Hide
          Robert Muir added a comment -

          Grant, I thought it was ok from Uwe's comment:

          I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1).

          I guess now I am a little confused about what should happen for 3.0 with contrib in general?
          No problem moving this to 3.1, let me know!

          Show
          Robert Muir added a comment - Grant, I thought it was ok from Uwe's comment: I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1). I guess now I am a little confused about what should happen for 3.0 with contrib in general? No problem moving this to 3.1, let me know!
          Hide
          Grant Ingersoll added a comment -

          Why are new features going into 3.0? I was under the impression that 3.0 was just supposed to be cleanup plus Java 1.5

          Show
          Grant Ingersoll added a comment - Why are new features going into 3.0? I was under the impression that 3.0 was just supposed to be cleanup plus Java 1.5
          Hide
          Uwe Schindler added a comment -

          No prob! I will help you, I am on heavy committing

          Show
          Uwe Schindler added a comment - No prob! I will help you, I am on heavy committing
          Hide
          Robert Muir added a comment -

          if no one objects, i'd like to commit this in a few days. Can someone help out and commit the update to NOTICE?

          Show
          Robert Muir added a comment - if no one objects, i'd like to commit this in a few days. Can someone help out and commit the update to NOTICE?
          Hide
          Robert Muir added a comment -

          if anyone can spare a sec to take a glance/review before 3.0, i think its ok...

          Show
          Robert Muir added a comment - if anyone can spare a sec to take a glance/review before 3.0, i think its ok...
          Hide
          Robert Muir added a comment -

          updated patch to trunk:

          • add support for optional regex features
          • remove recursion
          • improve performance for worst-case regexp/wildcard/FSM
          • improved docs & test
          • remove the fuzzy impl, NFA->DFA too slow for this, maybe a later addition.
          Show
          Robert Muir added a comment - updated patch to trunk: add support for optional regex features remove recursion improve performance for worst-case regexp/wildcard/FSM improved docs & test remove the fuzzy impl, NFA->DFA too slow for this, maybe a later addition.
          Hide
          Robert Muir added a comment -

          Uwe, sure. I will bring this patch up to speed (java 5, etc)

          Show
          Robert Muir added a comment - Uwe, sure. I will bring this patch up to speed (java 5, etc)
          Hide
          Uwe Schindler added a comment -

          Robert: Do you want to take this again? It's your's and contrib

          Show
          Uwe Schindler added a comment - Robert: Do you want to take this again? It's your's and contrib
          Hide
          Otis Gospodnetic added a comment -

          Regarding the license - I think we already have BRICS in one of Nutch's plugins, so we should be OK with the BSD licensed jar in our repo.

          ./urlfilter-automaton/lib/automaton.jar

          Show
          Otis Gospodnetic added a comment - Regarding the license - I think we already have BRICS in one of Nutch's plugins, so we should be OK with the BSD licensed jar in our repo. ./urlfilter-automaton/lib/automaton.jar
          Hide
          Robert Muir added a comment -

          Uwe, ok.

          Not to try to complicate things, but related to LUCENE-1689 and java 1.5, I could easily modify the Wildcard functionality here to work correctly with suppl. characters

          This could be an alternative to fixing the WildcardQuery ? operator in core.

          Show
          Robert Muir added a comment - Uwe, ok. Not to try to complicate things, but related to LUCENE-1689 and java 1.5, I could easily modify the Wildcard functionality here to work correctly with suppl. characters This could be an alternative to fixing the WildcardQuery ? operator in core.
          Hide
          Uwe Schindler added a comment -

          I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1).

          Robert: you could supply a patch with StringBuilder toString() variants and all those @Override uncommented-in. And it works correct with 1.5 (I am working with 1.5 here locally - I hate 1.6...).

          Show
          Uwe Schindler added a comment - I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1). Robert: you could supply a patch with StringBuilder toString() variants and all those @Override uncommented-in. And it works correct with 1.5 (I am working with 1.5 here locally - I hate 1.6...).
          Hide
          Robert Muir added a comment -

          Uwe, sorry about this.

          I did just verify automaton.jar can be compiled for Java 5 (at least it does not have java 1.6 dependencies), so perhaps this can be integrated for a later release.

          Show
          Robert Muir added a comment - Uwe, sorry about this. I did just verify automaton.jar can be compiled for Java 5 (at least it does not have java 1.6 dependencies), so perhaps this can be integrated for a later release.
          Hide
          Uwe Schindler added a comment -

          So I tend to move this to 3.0 or 3.1, because of missing support in regex contrib.

          Show
          Uwe Schindler added a comment - So I tend to move this to 3.0 or 3.1, because of missing support in regex contrib.
          Hide
          Uwe Schindler added a comment -

          Doesn't seem to work, I will check the sources:

          compile-core:
              [javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java
              [javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\regex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton
              [javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton
          .jar(dk/brics/automaton/Automaton.class)
              [javac] class file has wrong version 49.0, should be 48.0
              [javac] Please remove or make sure it appears in the correct subdirectory of
           the classpath.
              [javac] import dk.brics.automaton.Automaton;
              [javac]                           ^
              [javac] 1 error
          
          Show
          Uwe Schindler added a comment - Doesn't seem to work, I will check the sources: compile-core: [javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java [javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\regex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton [javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton .jar(dk/brics/automaton/Automaton.class) [javac] class file has wrong version 49.0, should be 48.0 [javac] Please remove or make sure it appears in the correct subdirectory of the classpath. [javac] import dk.brics.automaton.Automaton; [javac] ^ [javac] 1 error
          Hide
          Robert Muir added a comment -

          Uwe, you are correct, I just took a glance at the automaton source code and saw StringBuilder, so I think it is safe to say it only works with 1.5...

          Show
          Robert Muir added a comment - Uwe, you are correct, I just took a glance at the automaton source code and saw StringBuilder, so I think it is safe to say it only works with 1.5...
          Hide
          Uwe Schindler added a comment -

          Robert: I applied the patch locally, one test was still using @Override, fixed that. I did only download automaton.jar not the source package.

          Do you know, if automaton.jar is compiled using -source 1.4 -target 1.4 (it was compiled using ant 1.7 and Java 1.6). If not sure, I will try to build it again from source and use the correct compiler switches. The regex contrib module is Java 1.4 until now. If automaton only works with 1.5, we should wait until 3.0 to release it.

          Show
          Uwe Schindler added a comment - Robert: I applied the patch locally, one test was still using @Override, fixed that. I did only download automaton.jar not the source package. Do you know, if automaton.jar is compiled using -source 1.4 -target 1.4 (it was compiled using ant 1.7 and Java 1.6). If not sure, I will try to build it again from source and use the correct compiler switches. The regex contrib module is Java 1.4 until now. If automaton only works with 1.5, we should wait until 3.0 to release it.
          Hide
          Mark Miller added a comment -

          I don't think there is a problem with BSD. I know Grant has committed a BSD licensed stop word list in the past.

          I've asked explicitly about it before, but got no response.

          I'll try and dig a little, but Grant is the PMC head and he did it, so we wouldnt be following bad company...

          Show
          Mark Miller added a comment - I don't think there is a problem with BSD. I know Grant has committed a BSD licensed stop word list in the past. I've asked explicitly about it before, but got no response. I'll try and dig a little, but Grant is the PMC head and he did it, so we wouldnt be following bad company...
          Hide
          Uwe Schindler added a comment -

          I take it, I think it is almost finished. The only problems at the moment are bundling the external library in contrib, which is BSD licensed, are there any problems?

          If not, I can manage the inclusion into the regex contrib.

          Show
          Uwe Schindler added a comment - I take it, I think it is almost finished. The only problems at the moment are bundling the external library in contrib, which is BSD licensed, are there any problems? If not, I can manage the inclusion into the regex contrib.
          Hide
          Mark Miller added a comment -

          This is a cool issue, but it hasn't found an assignee yet. We may have to push it to 3.1.

          Any interest Uwe?

          Show
          Mark Miller added a comment - This is a cool issue, but it hasn't found an assignee yet. We may have to push it to 3.1. Any interest Uwe?
          Hide
          Robert Muir added a comment -

          removed use of multitermquery's getTerm()

          equals/hashcode are defined based upon the field and the language accepted by the FSM, i.e. regex query of AB.*C equals() wildcard query of AB*C because they are the same.

          Show
          Robert Muir added a comment - removed use of multitermquery's getTerm() equals/hashcode are defined based upon the field and the language accepted by the FSM, i.e. regex query of AB.*C equals() wildcard query of AB*C because they are the same.
          Hide
          Robert Muir added a comment -

          eks in case this makes it a little better explanation for your example,
          assume a huge term dictionary where words start with a-zA-Z for simplicity.

          for each character in that alphabet it will look for 'Xana' and 'Xna' in the worst case.
          thats 110 comparisons to check all the words that don't start with 'a'.
          (the enumeration thru all the words that start with 'a' is a little more complex).

          if you have say, 1M unique terms you can see how doing something like 100-200 comparisons is a lot better than 1M.

          Show
          Robert Muir added a comment - eks in case this makes it a little better explanation for your example, assume a huge term dictionary where words start with a-zA-Z for simplicity. for each character in that alphabet it will look for 'Xana' and 'Xna' in the worst case. thats 110 comparisons to check all the words that don't start with 'a'. (the enumeration thru all the words that start with 'a' is a little more complex). if you have say, 1M unique terms you can see how doing something like 100-200 comparisons is a lot better than 1M.
          Hide
          Robert Muir added a comment -

          eks in your example it does three comparisons instead of four (not much of a gain for this example, but a big gain on a real index)

          this is because it doesnt need to compare 'two', after encountering 'three' it requests TermEnum("uana"), which returns null.

          i hope you can see how this helps for a large index... (or i can try to construct a more realistic example)

          Show
          Robert Muir added a comment - eks in your example it does three comparisons instead of four (not much of a gain for this example, but a big gain on a real index) this is because it doesnt need to compare 'two', after encountering 'three' it requests TermEnum("uana"), which returns null. i hope you can see how this helps for a large index... (or i can try to construct a more realistic example)
          Hide
          Robert Muir added a comment -

          eks, well it does work well for fuzzy n=1 (I have tested against my huge
          index).

          for your simple dictionary it will do 3 comparisons instead of 4.
          this is because your simple dictionary is sorted in the index as such:
          four
          one
          three
          two

          when it encounters 'three' it will next ask for a TermEnum("una") which will
          return null.

          give it a try on a big dictionary, you might be surprised


          Robert Muir
          rcmuir@gmail.com

          Show
          Robert Muir added a comment - eks, well it does work well for fuzzy n=1 (I have tested against my huge index). for your simple dictionary it will do 3 comparisons instead of 4. this is because your simple dictionary is sorted in the index as such: four one three two when it encounters 'three' it will next ask for a TermEnum("una") which will return null. give it a try on a big dictionary, you might be surprised – Robert Muir rcmuir@gmail.com
          Hide
          Eks Dev added a comment -

          hmmm, sounds like good idea, but I am still not convinced it would work for Fuzzy

          take simple dictionary:
          one
          two
          three
          four

          query Term is, e.g. "ana", right? and n=1, means your DFA would be:

          {.na, a.a, an., an, na, ana, .ana, ana., a.na, an.a, ana.}

          where dot represents any character in you alphabet.

          For the first element in DFA (in expanded form) you need to visit all terms, no matter how you walk DFA... or am I missing something?

          Where you could save time is actual calculation of LD Matrix for terms that do not pass automata

          Show
          Eks Dev added a comment - hmmm, sounds like good idea, but I am still not convinced it would work for Fuzzy take simple dictionary: one two three four query Term is, e.g. "ana", right? and n=1, means your DFA would be: {.na, a.a, an., an, na, ana, .ana, ana., a.na, an.a, ana.} where dot represents any character in you alphabet. For the first element in DFA (in expanded form) you need to visit all terms, no matter how you walk DFA... or am I missing something? Where you could save time is actual calculation of LD Matrix for terms that do not pass automata
          Hide
          Robert Muir added a comment -

          eks:

          the AutomatonTermEnumerator in this patch does walk the term dictionary according to the transitions present in the DFA. Thats what this JIRA issue is all about to me, not iterating all the terms! So you do not need the complete dictionary as a DFA.

          for example: a regexp query of (a|b)cdefg with this patch seeks to 'acdefg', then 'bcdefg', as opposed to the current regex support which exhaustively enumerates all terms.

          slightly more complex example, query of (a|b)cd*efg first seeks to 'acd' (because of kleen star operator). suppose it then encounters term 'acda', it will next seek to 'acdd', etc. if it encounters 'acdf', then next it seeks to 'bcd'.

          this patch implements regex, wildcard, and fuzzy with n=1 in terms of this enumeration. what it doesnt do is fuzzy with arbitrary n!.

          I used the simplistic quadratic method to compute a DFA for fuzzy with n=1 for the FuzzyAutomatonQuery present in this patch, the paper has a more complicate but linear method to compute the DFA.

          Show
          Robert Muir added a comment - eks: the AutomatonTermEnumerator in this patch does walk the term dictionary according to the transitions present in the DFA. Thats what this JIRA issue is all about to me, not iterating all the terms! So you do not need the complete dictionary as a DFA. for example: a regexp query of (a|b)cdefg with this patch seeks to 'acdefg', then 'bcdefg', as opposed to the current regex support which exhaustively enumerates all terms. slightly more complex example, query of (a|b)cd*efg first seeks to 'acd' (because of kleen star operator). suppose it then encounters term 'acda', it will next seek to 'acdd', etc. if it encounters 'acdf', then next it seeks to 'bcd'. this patch implements regex, wildcard, and fuzzy with n=1 in terms of this enumeration. what it doesnt do is fuzzy with arbitrary n!. I used the simplistic quadratic method to compute a DFA for fuzzy with n=1 for the FuzzyAutomatonQuery present in this patch, the paper has a more complicate but linear method to compute the DFA.
          Hide
          Eks Dev added a comment -

          Robert,
          in order for Lev. Automata to work, you need to have the complete dictionary as DFA. Once you have dictionary as DFA (or any sort of trie), computing simple regex-s or simple fixed or weighted Levenshtein distance becomes a snap. Levenshtein-Automata is particularity fast at it, much simpler and only slightly slower method (one pager code) "K.Oflazer"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.3862

          As said, you cannot really walk current term dictionary as automata/trie (or you have an idea on how to do that?). I guess there is enough application where stoing complete Term dictionary into RAM-DFA is not a problem. Even making some smart (heavily cached) persistent trie/DFA should not be all that complex.

          Or you intended just to iterate all terms, and compute distance faster "break LD Matrix computation as soon as you see you hit the boundary"? But this requires iteration over all terms?

          I have done something similar, in memory, but unfortunately someone else paid me for this and is not willing to share...

          Show
          Eks Dev added a comment - Robert, in order for Lev. Automata to work, you need to have the complete dictionary as DFA. Once you have dictionary as DFA (or any sort of trie), computing simple regex-s or simple fixed or weighted Levenshtein distance becomes a snap. Levenshtein-Automata is particularity fast at it, much simpler and only slightly slower method (one pager code) "K.Oflazer"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.3862 As said, you cannot really walk current term dictionary as automata/trie (or you have an idea on how to do that?). I guess there is enough application where stoing complete Term dictionary into RAM-DFA is not a problem. Even making some smart (heavily cached) persistent trie/DFA should not be all that complex. Or you intended just to iterate all terms, and compute distance faster "break LD Matrix computation as soon as you see you hit the boundary"? But this requires iteration over all terms? I have done something similar, in memory, but unfortunately someone else paid me for this and is not willing to share...
          Hide
          Robert Muir added a comment -

          found this interesting article applicable to this query: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

          "We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W."

          Show
          Robert Muir added a comment - found this interesting article applicable to this query: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 "We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W."
          Hide
          Robert Muir added a comment -

          this includes an alternative for another slow linear query, fuzzy query.

          automatonfuzzyquery creates a DFA that accepts all strings within an edit distance of 1.

          on my 100M term index this works pretty well:
          fuzzy: 251,219 ms
          automatonfuzzy: 172 ms

          while its true its limited to edit distance of one, on the other hand it supports transposition and is fast.

          Show
          Robert Muir added a comment - this includes an alternative for another slow linear query, fuzzy query. automatonfuzzyquery creates a DFA that accepts all strings within an edit distance of 1. on my 100M term index this works pretty well: fuzzy: 251,219 ms automatonfuzzy: 172 ms while its true its limited to edit distance of one, on the other hand it supports transposition and is fast.
          Hide
          Robert Muir added a comment -

          updated with smarter enumeration. I think this is mathematically the best you can get with a DFA.

          for example if the regexp is (a|b)cdefg it knows to position at acdefg, then bcdefg, etc
          if the regexp is (a|b)cd*efg it can only position at acd, etc.

          nextString() is now cpu-friendly, and instead walks the state transition character intervals in sorted order instead of brute-forcing characters.

          Show
          Robert Muir added a comment - updated with smarter enumeration. I think this is mathematically the best you can get with a DFA. for example if the regexp is (a|b)cdefg it knows to position at acdefg, then bcdefg, etc if the regexp is (a|b)cd*efg it can only position at acd, etc. nextString() is now cpu-friendly, and instead walks the state transition character intervals in sorted order instead of brute-forcing characters.
          Hide
          Mark Miller added a comment -

          If you think its ok, I could set constant score rewrite=true in this case.

          I agree that it should just be left up to the user. Its probably not a good idea to change the scoring for what to a user could appear to be arbitrary queries.

          Show
          Mark Miller added a comment - If you think its ok, I could set constant score rewrite=true in this case. I agree that it should just be left up to the user. Its probably not a good idea to change the scoring for what to a user could appear to be arbitrary queries.
          Hide
          Mark Miller added a comment -

          When refactoring multitermquery I tried just computing the bit set iterator on the fly. It did not appear to work out, but I wonder if there are cases where it would be a better option.

          For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true.

          Okay, that starts to make more sense then. I think the reports that it was faster on some large indexes was based on wildcard queries I think (hard to remember 100%).

          Show
          Mark Miller added a comment - When refactoring multitermquery I tried just computing the bit set iterator on the fly. It did not appear to work out, but I wonder if there are cases where it would be a better option. For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true. Okay, that starts to make more sense then. I think the reports that it was faster on some large indexes was based on wildcard queries I think (hard to remember 100%).
          Hide
          Robert Muir added a comment -

          Uwe, ok based on your tests I tried some of my own... on my index when the query matches like less than 10-20% of the docs Query method is faster.

          when it matches something like over 20%, the Filter method starts to win.

          Show
          Robert Muir added a comment - Uwe, ok based on your tests I tried some of my own... on my index when the query matches like less than 10-20% of the docs Query method is faster. when it matches something like over 20%, the Filter method starts to win.
          Hide
          Uwe Schindler added a comment - - edited

          I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not.

          Yes, it should be normally be left to the user. And the slower filter on large indexes with only sparingly filled bitsets is related to LUCENE-1536.

          E.g. I did some comparisions for TrieRangeQuery on a 5 mio doc index, integer field, 8 bit precision step (so about 400 terms per query), the filter is about double as fast. But the ranges were random and hit about 1/3 of all documents in average per query, so the bitset is not so sparse.
          TrieRangeQuery is a typical example of a MultiTermQuery, that also works well with Boolean rewrite, because the upper term count is limited by the precision step (for ints and 8 bit the theoretical, but never reached, maximum is about 1700 terms, for lower precisionSteps even less).

          Show
          Uwe Schindler added a comment - - edited I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not. Yes, it should be normally be left to the user. And the slower filter on large indexes with only sparingly filled bitsets is related to LUCENE-1536 . E.g. I did some comparisions for TrieRangeQuery on a 5 mio doc index, integer field, 8 bit precision step (so about 400 terms per query), the filter is about double as fast. But the ranges were random and hit about 1/3 of all documents in average per query, so the bitset is not so sparse. TrieRangeQuery is a typical example of a MultiTermQuery, that also works well with Boolean rewrite, because the upper term count is limited by the precision step (for ints and 8 bit the theoretical, but never reached, maximum is about 1700 terms, for lower precisionSteps even less).
          Hide
          Robert Muir added a comment -

          yes, I just verified and can easily and quickly detect if the FSM can accept more than BooleanQuery.getMaxClauseCount() Strings.

          !Automaton.isFinite() || Automaton.getFiniteStrings(BooleanQuery.getMaxClauseCount()) == null

          If you think its ok, I could set constant score rewrite=true in this case.

          Show
          Robert Muir added a comment - yes, I just verified and can easily and quickly detect if the FSM can accept more than BooleanQuery.getMaxClauseCount() Strings. !Automaton.isFinite() || Automaton.getFiniteStrings(BooleanQuery.getMaxClauseCount()) == null If you think its ok, I could set constant score rewrite=true in this case.
          Hide
          Robert Muir added a comment -

          Uwe: yes I tried to think of some heuristics for this query to guess which would be the best method.

          For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true.

          I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not.

          Show
          Robert Muir added a comment - Uwe: yes I tried to think of some heuristics for this query to guess which would be the best method. For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true. I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not.
          Hide
          Uwe Schindler added a comment -

          For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit.

          I thought about that, too. Maybe there will be a possibility to do an auto-switch in MultiTermQuery. If a TooManyBooleanClauses exception is catched during the rewrite() method, it could fall back to returning the ConstantScore variant. The problem: The time for iterating the terms until the Exception thrown is lost... Maybe we could store the iterated terms for reuse (if FilteredTermEnum or a wrapper like BufferedTermEnum has something like the known mark() option from BufferedInputStreams).

          This is just an idea, but has nothing to do with this query, it affects all MultiTermQueries.

          Show
          Uwe Schindler added a comment - For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit. I thought about that, too. Maybe there will be a possibility to do an auto-switch in MultiTermQuery. If a TooManyBooleanClauses exception is catched during the rewrite() method, it could fall back to returning the ConstantScore variant. The problem: The time for iterating the terms until the Exception thrown is lost... Maybe we could store the iterated terms for reuse (if FilteredTermEnum or a wrapper like BufferedTermEnum has something like the known mark() option from BufferedInputStreams). This is just an idea, but has nothing to do with this query, it affects all MultiTermQueries.
          Hide
          Robert Muir added a comment -

          well here it is just for the record:

          in the query case (fast), time is dominated by AutomatonTermEnum.next(). This is what I expect.
          in the filter case (slower), time is instead dominated by OpenBitSetIterator.next().

          I've seen this with simpler (non-MultiTermQuery) queries before as well.

          For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit.

          Show
          Robert Muir added a comment - well here it is just for the record: in the query case (fast), time is dominated by AutomatonTermEnum.next(). This is what I expect. in the filter case (slower), time is instead dominated by OpenBitSetIterator.next(). I've seen this with simpler (non-MultiTermQuery) queries before as well. For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit.
          Hide
          Robert Muir added a comment -

          my test queries are ones that match like 50-100 out of those 116,000,000... so maybe this helps paint the picture.

          i can profile each one if you are curious?

          Show
          Robert Muir added a comment - my test queries are ones that match like 50-100 out of those 116,000,000... so maybe this helps paint the picture. i can profile each one if you are curious?
          Hide
          Robert Muir added a comment -

          ~ 116,000,000 terms.

          I've seen the same behavior with other lucene queries on this index, where I do not care about score and thought filter would be best, but queries still have the edge.

          Show
          Robert Muir added a comment - ~ 116,000,000 terms. I've seen the same behavior with other lucene queries on this index, where I do not care about score and thought filter would be best, but queries still have the edge.
          Hide
          Mark Miller added a comment -

          How many terms are being enumerated for the test? My guess is that for queries that turn into very large BooleanQueries, it can be much faster to build the filter, but for a smaller BooleanQuery or TermQuery, filter construction dominates?

          Show
          Mark Miller added a comment - How many terms are being enumerated for the test? My guess is that for queries that turn into very large BooleanQueries, it can be much faster to build the filter, but for a smaller BooleanQuery or TermQuery, filter construction dominates?
          Hide
          Robert Muir added a comment -

          its ~700ms if i .setConstantScoreRewrite(true)
          its ~150ms otherwise...

          Show
          Robert Muir added a comment - its ~700ms if i .setConstantScoreRewrite(true) its ~150ms otherwise...
          Hide
          Mark Miller added a comment -

          on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?)

          Thats surprising, because I have seen people state the opposite on a couple occasions. Perhaps it has to do with how many terms are being enumerated?

          Show
          Mark Miller added a comment - on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?) Thats surprising, because I have seen people state the opposite on a couple occasions. Perhaps it has to do with how many terms are being enumerated?
          Hide
          Robert Muir added a comment -

          ok I refactored this to use FilteredTermEnum/MultiTermQuery as Uwe suggested.

          on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?)

          I also changed the term enumeration to be a bit smarter, so it will work well on a large alphabet like CJK now.

          Show
          Robert Muir added a comment - ok I refactored this to use FilteredTermEnum/MultiTermQuery as Uwe suggested. on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?) I also changed the term enumeration to be a bit smarter, so it will work well on a large alphabet like CJK now.
          Hide
          Uwe Schindler added a comment -

          Let's stay with this issue!

          Show
          Uwe Schindler added a comment - Let's stay with this issue!
          Hide
          Robert Muir added a comment -

          Uwe, thanks. I'll think on this and on other improvements.
          I'm not really confident in my ability to make the code much cleaner at the end of the day, but more efficient and get some things for free as you say.
          For now it is working much better than a linear scan, and the improvements wont change the order, but might help a bit.

          Think i should try to correct this issue or create a separate issue?

          Show
          Robert Muir added a comment - Uwe, thanks. I'll think on this and on other improvements. I'm not really confident in my ability to make the code much cleaner at the end of the day, but more efficient and get some things for free as you say. For now it is working much better than a linear scan, and the improvements wont change the order, but might help a bit. Think i should try to correct this issue or create a separate issue?
          Show
          Uwe Schindler added a comment - I committed TrieRange revision 765618. You can see the impl here: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/trie/TrieRangeTermEnum.java?view=markup
          Hide
          Robert Muir added a comment -

          Uwe, i'll look and see how you do it for TrieRange.

          if it can make the code for this simpler that will be fantastic. maybe by then I will have also figured out some way to cleanly and non-recursively use min/max character intervals in the state machine to decrease the amount of seeks and optimize a little bit.

          Show
          Robert Muir added a comment - Uwe, i'll look and see how you do it for TrieRange. if it can make the code for this simpler that will be fantastic. maybe by then I will have also figured out some way to cleanly and non-recursively use min/max character intervals in the state machine to decrease the amount of seeks and optimize a little bit.
          Hide
          Uwe Schindler added a comment -

          It will work, that was what I said. For MultiTermQuery, it must not be ordered, the ordering is irrelevant for it, MultTermQuery only enumerates the terms. TrieRange is an example of that, the order of terms is not for sure ordered correctly (it is at the moment because of the internal implementation of splitLongRange(), but I tested it with the inverse order and it still worked). If you want to use the enum for something other, it will fail.
          The filters inside MultiTermQuery and the BooleanQuery do not need to have the terms ordered.

          Show
          Uwe Schindler added a comment - It will work, that was what I said. For MultiTermQuery, it must not be ordered, the ordering is irrelevant for it, MultTermQuery only enumerates the terms. TrieRange is an example of that, the order of terms is not for sure ordered correctly (it is at the moment because of the internal implementation of splitLongRange(), but I tested it with the inverse order and it still worked). If you want to use the enum for something other, it will fail. The filters inside MultiTermQuery and the BooleanQuery do not need to have the terms ordered.
          Hide
          Robert Muir added a comment -

          Uwe, I agree with you, with one caveat: for this functionality to work the Enum must be ordered correctly according to Term.compareTo().

          Otherwise it will not work correctly...

          Show
          Robert Muir added a comment - Uwe, I agree with you, with one caveat: for this functionality to work the Enum must be ordered correctly according to Term.compareTo(). Otherwise it will not work correctly...
          Hide
          Uwe Schindler added a comment -

          I looked into the patch, looks good. Maybe it would be good to make the new AutomatonRegExQuey als a subclass of MultiTermQuery. As you also seek/exchange the TermEnum, the needed FilteredTermEnum may be a little bit complicated. But you may do it in the same way like I commit soon for TrieRange (LUCENE-1602).
          The latest changes from LUCENE-1603 make it possible to write a FilteredTermEnum, that handles over to different positioned TermEnums like you do.
          With MultiTermQuery you get all for free: ConstantScore, Boolean rewrite and optionally the Filter (which is not needed here, I think). And: You could also overwrite difference in FilteredTermEnum to rank the hits.
          A note: The FilteredTermEnum created by TrieRange is not for sure really ordered correctly according Term.compareTo(), but this is not really needed for MultiTermQuery.

          Show
          Uwe Schindler added a comment - I looked into the patch, looks good. Maybe it would be good to make the new AutomatonRegExQuey als a subclass of MultiTermQuery. As you also seek/exchange the TermEnum, the needed FilteredTermEnum may be a little bit complicated. But you may do it in the same way like I commit soon for TrieRange ( LUCENE-1602 ). The latest changes from LUCENE-1603 make it possible to write a FilteredTermEnum, that handles over to different positioned TermEnums like you do. With MultiTermQuery you get all for free: ConstantScore, Boolean rewrite and optionally the Filter (which is not needed here, I think). And: You could also overwrite difference in FilteredTermEnum to rank the hits. A note: The FilteredTermEnum created by TrieRange is not for sure really ordered correctly according Term.compareTo(), but this is not really needed for MultiTermQuery.
          Hide
          Robert Muir added a comment -

          Mike the thing it cant do is stuff that cannot be determinized. However I think you only need an NFA for capturing group related things:

          http://oreilly.com/catalog/regex/chapter/ch04.html

          One thing is that the brics syntax is a bit different. i.e. ^ and $ are implied and I think some things need to be escaped.
          So I think it can do everything RegexQuery does, but maybe different syntax is required.

          Show
          Robert Muir added a comment - Mike the thing it cant do is stuff that cannot be determinized. However I think you only need an NFA for capturing group related things: http://oreilly.com/catalog/regex/chapter/ch04.html One thing is that the brics syntax is a bit different. i.e. ^ and $ are implied and I think some things need to be escaped. So I think it can do everything RegexQuery does, but maybe different syntax is required.
          Hide
          Michael McCandless added a comment -

          Can this do everything that RegexQuery currently does? (Ie we'd deprecate RegexQuery)?

          Show
          Michael McCandless added a comment - Can this do everything that RegexQuery currently does? (Ie we'd deprecate RegexQuery)?
          Hide
          Robert Muir added a comment -

          mark yeah, the enumeration helps a lot, it means a lot less comparisons, plus brics is FAST.

          inside the AutomatonFilter i describe how it could possibly be done better, but I was afraid I would mess it up.
          its affected somewhat by the size of the alphabet so if you were using it against lots of CJK text, it might be worth it to instead use the State/Transition objects in the package. Transitions are described by min and max character intervals and you can access intervals in sorted order...

          its all so nice but I figure this is a start.

          Show
          Robert Muir added a comment - mark yeah, the enumeration helps a lot, it means a lot less comparisons, plus brics is FAST . inside the AutomatonFilter i describe how it could possibly be done better, but I was afraid I would mess it up. its affected somewhat by the size of the alphabet so if you were using it against lots of CJK text, it might be worth it to instead use the State/Transition objects in the package. Transitions are described by min and max character intervals and you can access intervals in sorted order... its all so nice but I figure this is a start.
          Hide
          Robert Muir added a comment -

          oops I did say in javadocs score is constant / boost only so when Wildcard has no wildcards and rewrites to termquery, wrap it with ConstantScoreQuery(QueryWrapperFilter)) to ensure this.

          Show
          Robert Muir added a comment - oops I did say in javadocs score is constant / boost only so when Wildcard has no wildcards and rewrites to termquery, wrap it with ConstantScoreQuery(QueryWrapperFilter)) to ensure this.
          Hide
          Mark Miller added a comment -

          Very nice Robert. This looks like it would make a very nice addition to our regex support.

          Found the benchmarks here quite interesting: http://tusker.org/regex/regex_benchmark.html (though it sounds like your "special" enumeration technique makes this regex imp even faster for our uses?)

          Show
          Mark Miller added a comment - Very nice Robert. This looks like it would make a very nice addition to our regex support. Found the benchmarks here quite interesting: http://tusker.org/regex/regex_benchmark.html (though it sounds like your "special" enumeration technique makes this regex imp even faster for our uses?)
          Hide
          Robert Muir added a comment -

          Here is an updated patch with AutomatonWildCardQuery.

          This implements standard Lucene Wildcard query with AutomatonFilter.

          This accelerates quite a few wildcard situations, such as ??(a|b)?cd*ef
          Sorry, provides no help for leading *, but definitely for leading ?.

          All wildcard tests pass.

          Show
          Robert Muir added a comment - Here is an updated patch with AutomatonWildCardQuery. This implements standard Lucene Wildcard query with AutomatonFilter. This accelerates quite a few wildcard situations, such as ??(a|b)?cd*ef Sorry, provides no help for leading *, but definitely for leading ?. All wildcard tests pass.
          Hide
          Robert Muir added a comment -

          patch

          Show
          Robert Muir added a comment - patch

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development