Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7465

Add a PatternTokenizer that uses Lucene's RegExp implementation

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I think there are some nice benefits to a version of PatternTokenizer that uses Lucene's RegExp impl instead of the JDK's:

      • Lucene's RegExp is compiled to a DFA up front, so if a "too hard" RegExp is attempted the user discovers it up front instead of later on when a "lucky" document arrives
      • It processes the incoming characters as a stream, only pulling 128 characters at a time, vs the existing PatternTokenizer which currently reads the entire string up front (this has caused heap problems in the past)
      • It should be fast.

      I named it SimplePatternTokenizer, and it still needs a factory and improved tests, but I think it's otherwise close.

      It currently does not take a group parameter because Lucene's RegExps don't yet implement sub group capture. I think we could add that at some point, but it's a bit tricky.

      This doesn't even have group=-1 support (like String.split) ... I think if we did that we should maybe name it differently (SimplePatternSplitTokenizer?).

      1. LUCENE-7465.patch
        55 kB
        Michael McCandless
      2. LUCENE-7465.patch
        33 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Patch.

        I added the SimplePatternTokenizerFactory as well.

        SimplePatternTokenizer takes either a String (parses as a regexp
        and compiles it) or a DFA (expert user who pre-built their own
        automaton).

        I folded in a nice idea from Robert Muir to optimize the ascii code
        points even when using CharacterRunAutomaton.

        It's quite fast, ~46% faster than PatternTokenizer when tokenizing
        1 MB chunks from the English Wikipedia export, using a simplistic
        whitespace regexp [^ \t\r\n]+.

        And it's nice that it doesn't read the entire input into heap!

        Show
        mikemccand Michael McCandless added a comment - Patch. I added the SimplePatternTokenizerFactory as well. SimplePatternTokenizer takes either a String (parses as a regexp and compiles it) or a DFA (expert user who pre-built their own automaton). I folded in a nice idea from Robert Muir to optimize the ascii code points even when using CharacterRunAutomaton . It's quite fast, ~46% faster than PatternTokenizer when tokenizing 1 MB chunks from the English Wikipedia export, using a simplistic whitespace regexp [^ \t\r\n]+ . And it's nice that it doesn't read the entire input into heap!
        Hide
        mikemccand Michael McCandless added a comment -

        Another iteration, adding SimplePatternSplitTokenizer. It's surprisingly different from the non-spit case, and sort of complex But it does pass its tests. I haven't compared performance to PatternTokenizer with group -1 yet. I'll see if I can simplify it, but I think this is otherwise close.

        Show
        mikemccand Michael McCandless added a comment - Another iteration, adding SimplePatternSplitTokenizer . It's surprisingly different from the non-spit case, and sort of complex But it does pass its tests. I haven't compared performance to PatternTokenizer with group -1 yet. I'll see if I can simplify it, but I think this is otherwise close.
        Hide
        dweiss Dawid Weiss added a comment -

        Interesting that it's faster than PatternTokenizer! I haven't looked at the patch, Mike, I did some experiments recently with regexp benchmarking (for our internal needs) and fairly large regular expression patterns (over an even larger inputs). The native java pattern implementation always won by a large (and I mean: super large) margin over anything else I tried. I tried brics, re2 (java port), re2 (native implementation), Apache ORO (out of curiosity only, it didn't pass correctness tests for me). Brics wasn't too bad, but the gain from early detection of "too hard" DFA expressions was overshadowed by DFA expansion (very large automata in our case), so unless you don't have control over the patterns (in which case adversarial cases can be executed), it didn't make sense for me to switch.

        Also, the fact that the java implementation was fast was quite surprising to me as we had a large number of alternatives in regular expressions and I thought these would nicely yield to automaton optimizations (pullup of prefix matching, etc.). In the end, it didn't seem to matter. So perhaps the performance is a factor of how complex the regular expressions are (and how they're benchmarked)? Don't know.

        Show
        dweiss Dawid Weiss added a comment - Interesting that it's faster than PatternTokenizer! I haven't looked at the patch, Mike, I did some experiments recently with regexp benchmarking (for our internal needs) and fairly large regular expression patterns (over an even larger inputs). The native java pattern implementation always won by a large (and I mean: super large) margin over anything else I tried. I tried brics, re2 (java port), re2 (native implementation), Apache ORO (out of curiosity only, it didn't pass correctness tests for me). Brics wasn't too bad, but the gain from early detection of "too hard" DFA expressions was overshadowed by DFA expansion (very large automata in our case), so unless you don't have control over the patterns (in which case adversarial cases can be executed), it didn't make sense for me to switch. Also, the fact that the java implementation was fast was quite surprising to me as we had a large number of alternatives in regular expressions and I thought these would nicely yield to automaton optimizations (pullup of prefix matching, etc.). In the end, it didn't seem to matter. So perhaps the performance is a factor of how complex the regular expressions are (and how they're benchmarked)? Don't know.
        Hide
        dsmiley David Smiley added a comment -

        Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple". Then I wonder if we might detect circumstances in which this new implementation is preferable? The motivation for one factory is similar to the WhitespaceTokenizer "rule" param. People know they have a regexp and want to tokenize on it... they could easily overlook a different factory than the one that is already named well and may appear in blogs/docs/examples.

        Show
        dsmiley David Smiley added a comment - Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple" . Then I wonder if we might detect circumstances in which this new implementation is preferable? The motivation for one factory is similar to the WhitespaceTokenizer "rule" param. People know they have a regexp and want to tokenize on it... they could easily overlook a different factory than the one that is already named well and may appear in blogs/docs/examples.
        Hide
        mikemccand Michael McCandless added a comment -

        Dawid Weiss is this a benchmark I could try to run? My regexp was admittedly trivial so it would be nice to have a beefier real-world regexp to play with

        The bench is also trivial (I pushed it to luceneutil).

        When you tested dk.brics, did you call the RunAutomaton.setAlphabet? This should be a biggish speedup, especially if your regexp has many unique character start/end ranges. In Lucene's fork of dk.briks we automatically do that in the utf8 case, and with this patch, also for the first 256 unicode characters in the full character case.

        Show
        mikemccand Michael McCandless added a comment - Dawid Weiss is this a benchmark I could try to run? My regexp was admittedly trivial so it would be nice to have a beefier real-world regexp to play with The bench is also trivial (I pushed it to luceneutil). When you tested dk.brics, did you call the RunAutomaton.setAlphabet ? This should be a biggish speedup, especially if your regexp has many unique character start/end ranges. In Lucene's fork of dk.briks we automatically do that in the utf8 case, and with this patch, also for the first 256 unicode characters in the full character case.
        Hide
        mikemccand Michael McCandless added a comment -

        Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple".

        I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat?

        Then I wonder if we might detect circumstances in which this new implementation is preferable?

        I think such auto-detection (looking at the user's pattern and picking the engine) is a dangerous. Maybe a user is debugging a tricky regexp, and adding one new character causes us to pick a different engine or something. I think for now it should be a conscious choice?

        Show
        mikemccand Michael McCandless added a comment - Instead of adding another factory, what about adding an implementation hint parameter to PatternTokenizerFactory? e.g. method="lucene" or method="simple". I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat? Then I wonder if we might detect circumstances in which this new implementation is preferable? I think such auto-detection (looking at the user's pattern and picking the engine) is a dangerous. Maybe a user is debugging a tricky regexp, and adding one new character causes us to pick a different engine or something. I think for now it should be a conscious choice?
        Hide
        dweiss Dawid Weiss added a comment -

        I'll try to repeat the experiment with Lucene's regexp when I have a spare moment. The benchmarks (or rather: data) cannot be shared, unfortunately, but it involved regexps with hundreds of alternatives and globs. Definitely not something that people can edit by hand.

        Show
        dweiss Dawid Weiss added a comment - I'll try to repeat the experiment with Lucene's regexp when I have a spare moment. The benchmarks (or rather: data) cannot be shared, unfortunately, but it involved regexps with hundreds of alternatives and globs. Definitely not something that people can edit by hand.
        Hide
        mikemccand Michael McCandless added a comment -

        Maybe you could share just the regexp

        But, if you do repeat the test w/ Lucene, then try to use this patch if possible (just the XXXRunAutomaton changes), because it optimizes for code points < 256. Or, if your data is all non-ascii, then don't bother using this patch

        Show
        mikemccand Michael McCandless added a comment - Maybe you could share just the regexp But, if you do repeat the test w/ Lucene, then try to use this patch if possible (just the XXXRunAutomaton changes), because it optimizes for code points < 256. Or, if your data is all non-ascii, then don't bother using this patch
        Hide
        dsmiley David Smiley added a comment -

        I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat?

        Not a required param; default to current java regexp impl; no back-compat worry. It's the most flexible and so I think makes the best default.

        I think such auto-detection (looking at the user's pattern and picking the engine) is dangerous.

        Ok.

        Show
        dsmiley David Smiley added a comment - I agree this would be nice, but my worry about taking that approach is which one we default to? Maybe if we make it a required param? But then how to implement back compat? Not a required param; default to current java regexp impl; no back-compat worry. It's the most flexible and so I think makes the best default. I think such auto-detection (looking at the user's pattern and picking the engine) is dangerous. Ok.
        Hide
        mikemccand Michael McCandless added a comment -

        default to current java regexp impl;

        But I think that would mean this new impl would very rarely be used.

        I think it's better to give it a separate factory so it has more visibility? If it really does work better for users over time, word will spread, new blog posts/docs written, etc.

        Show
        mikemccand Michael McCandless added a comment - default to current java regexp impl; But I think that would mean this new impl would very rarely be used. I think it's better to give it a separate factory so it has more visibility? If it really does work better for users over time, word will spread, new blog posts/docs written, etc.
        Hide
        dweiss Dawid Weiss added a comment -

        These regexps are generated from the data, so not so easy And the data (and the regexps) can contain Unicode characters as well. I'll go back to this, time permitting. I'm not saying the patch is wrong, just that j.u.Pattern was pretty darn fast, even for large-scale patterns (and inputs). I was in particular surprised at re2 (C implementation) performance being way lower than Java's. Of course there were no adversarial cases in the input.

        Show
        dweiss Dawid Weiss added a comment - These regexps are generated from the data, so not so easy And the data (and the regexps) can contain Unicode characters as well. I'll go back to this, time permitting. I'm not saying the patch is wrong, just that j.u.Pattern was pretty darn fast, even for large-scale patterns (and inputs). I was in particular surprised at re2 (C implementation) performance being way lower than Java's. Of course there were no adversarial cases in the input.
        Hide
        jpountz Adrien Grand added a comment -

        I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox.

        Show
        jpountz Adrien Grand added a comment - I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox.
        Hide
        dweiss Dawid Weiss added a comment -

        Hi Mike. Sorry it took me so long. So, check out this example snippet:

          public static void main(String[] args) {
            String [] clauses = "(.*mervi)|(.*hectic)|(petrographic)|(terracing.*)|(3\\.65.*)|(.*mea.*)|(.*n0)|(researchbas)|(chamfer.*)|(.*danaher)|(.*immediacy)|(.*selec)|(.*transi)|(.*photoreaction)|(ceo2)|(asif)|(.*koo.*)|(lasso)|(allis)|(.*paleodata.*)|(needs.*)|(auser)|(micropterus.*)|(.*sdw)|(.*blp.*)|(cent)|(hybridoma)|(tai.*)|(ransac)|(.*gfptag)|(.*falt.*)|(tubular)|(.*closet.*)|(.*halted.*)|(plish.*)|(.*aauw)|(satisf)|(.*kolodn)|(.*glycidyl.*)|(phytodetritu.*)|(.*2r)|(.*remodeler)|(astronomi)|(.*maienschein)|(universityof)|(event\\(s)|(exacerbation)|(leidi.*)|(stemmer.*)|(.*arrow)|(.*domestic)|(.*maq.*)|(pluggable.*)|(scheiner.*)|(interpenetrate)|(.*diving)|(superscript.*)|(.*cherry.*)|(saddlepoint)|(pyrolit.*)|(prosser)|(nyberg)|(iceberg.*)|(.*hammer.*)|(india.*)|(fsa)|(.*x\\(u.*)|(klima)|(good.*)|(.*provid)|(.*streaked)|(.*oppenheimer.*)|(loyalty.*)|(.*caspi.*)|(.*,99)|(.*unaccompanied)|(subharmon)|(.*hillis.*)|(ferment)|(olli)|(.*storybook)|(1358)|(.*savi.*)|(contagion)|(.*freeness)|(.*500m)|(brudvig.*)|(.*genemark)|(.*jahren.*)|(aguirr)|(12345.*)|(.*prolic)|(seafood.*)|(.*remedy)|(.*mildred.*)|(.*bering.*)|(monolithically.*)|(disequilibrium)".split("\\|");
            for (int i = 1; i < clauses.length; i++) {
              String re = Arrays.stream(clauses) 
                  .limit(i)
                  .collect(Collectors.joining("|"));
              RegExp regExp = new RegExp(re);
              Automaton automaton = regExp.toAutomaton(10000);
              System.out.println("Clauses: " + i + ", states=" + automaton.getNumStates());
            }
          }
        

        As you can see it's essentially a "prefix/suffix/exact" match. Unfortunately this is a very bad example to determinize, so I can't even sensibly benchmark it against other implementations (there can be hundreds or thousands of such clauses). But even this short snippet shows the severe penalty full determinization incurrs – try to run it and you'll see.

        Btw. if you're looking into this again, piggyback a change to Operations.determinize and replace LinkedList with an ArrayDeque, it certainly won't hurt.

        Show
        dweiss Dawid Weiss added a comment - Hi Mike. Sorry it took me so long. So, check out this example snippet: public static void main( String [] args) { String [] clauses = "(.*mervi)|(.*hectic)|(petrographic)|(terracing.*)|(3\\.65.*)|(.*mea.*)|(.*n0)|(researchbas)|(chamfer.*)|(.*danaher)|(.*immediacy)|(.*selec)|(.*transi)|(.*photoreaction)|(ceo2)|(asif)|(.*koo.*)|(lasso)|(allis)|(.*paleodata.*)|(needs.*)|(auser)|(micropterus.*)|(.*sdw)|(.*blp.*)|(cent)|(hybridoma)|(tai.*)|(ransac)|(.*gfptag)|(.*falt.*)|(tubular)|(.*closet.*)|(.*halted.*)|(plish.*)|(.*aauw)|(satisf)|(.*kolodn)|(.*glycidyl.*)|(phytodetritu.*)|(.*2r)|(.*remodeler)|(astronomi)|(.*maienschein)|(universityof)|(event\\(s)|(exacerbation)|(leidi.*)|(stemmer.*)|(.*arrow)|(.*domestic)|(.*maq.*)|(pluggable.*)|(scheiner.*)|(interpenetrate)|(.*diving)|(superscript.*)|(.*cherry.*)|(saddlepoint)|(pyrolit.*)|(prosser)|(nyberg)|(iceberg.*)|(.*hammer.*)|(india.*)|(fsa)|(.*x\\(u.*)|(klima)|(good.*)|(.*provid)|(.*streaked)|(.*oppenheimer.*)|(loyalty.*)|(.*caspi.*)|(.*,99)|(.*unaccompanied)|(subharmon)|(.*hillis.*)|(ferment)|(olli)|(.*storybook)|(1358)|(.*savi.*)|(contagion)|(.*freeness)|(.*500m)|(brudvig.*)|(.*genemark)|(.*jahren.*)|(aguirr)|(12345.*)|(.*prolic)|(seafood.*)|(.*remedy)|(.*mildred.*)|(.*bering.*)|(monolithically.*)|(disequilibrium)" .split( "\\|" ); for ( int i = 1; i < clauses.length; i++) { String re = Arrays.stream(clauses) .limit(i) .collect(Collectors.joining( "|" )); RegExp regExp = new RegExp(re); Automaton automaton = regExp.toAutomaton(10000); System .out.println( "Clauses: " + i + ", states=" + automaton.getNumStates()); } } As you can see it's essentially a "prefix/suffix/exact" match. Unfortunately this is a very bad example to determinize, so I can't even sensibly benchmark it against other implementations (there can be hundreds or thousands of such clauses). But even this short snippet shows the severe penalty full determinization incurrs – try to run it and you'll see. Btw. if you're looking into this again, piggyback a change to Operations.determinize and replace LinkedList with an ArrayDeque , it certainly won't hurt.
        Hide
        dweiss Dawid Weiss added a comment -

        On a happier note, if it's just a union of fixed-strings (a fsa, effectively) you're matching against then it's much much faster with Lucene (and Brics), of course (times in ms.):

        JavaRegExpMatcher samples: 100000 time: 11026
        JavaRegExpMatcher samples: 100000 time: 11046
        JavaRegExpMatcher samples: 100000 time: 11036
        LuceneRegExpMatcher samples: 100000 time: 19
        LuceneRegExpMatcher samples: 100000 time: 19
        LuceneRegExpMatcher samples: 100000 time: 18
        
        Show
        dweiss Dawid Weiss added a comment - On a happier note, if it's just a union of fixed-strings (a fsa, effectively) you're matching against then it's much much faster with Lucene (and Brics), of course (times in ms.): JavaRegExpMatcher samples: 100000 time: 11026 JavaRegExpMatcher samples: 100000 time: 11046 JavaRegExpMatcher samples: 100000 time: 11036 LuceneRegExpMatcher samples: 100000 time: 19 LuceneRegExpMatcher samples: 100000 time: 19 LuceneRegExpMatcher samples: 100000 time: 18
        Hide
        mikemccand Michael McCandless added a comment -

        Thank you for the example Dawid Weiss. Indeed that's a hard regexp to determinize. It's interesting because the determinization requires many states, yet it minimizes to an apparently contained number of states (though many transitions).

        E.g. at 30 clauses, determized form produced 7652 states and 136898 transitions, but after minimize that drops to 150 states and 2960 transitions. I tried to run dot on this FSA but it struggles

        Net/net the DFA approach is not usable in some cases (like this one); such users must use the JDK implementation. Maybe we should explore an re2j version too.

        Btw. if you're looking into this again, piggyback a change to Operations.determinize and replace LinkedList with an ArrayDeque, it certainly won't hurt.

        Excellent, I'll fold that in!

        Show
        mikemccand Michael McCandless added a comment - Thank you for the example Dawid Weiss . Indeed that's a hard regexp to determinize. It's interesting because the determinization requires many states, yet it minimizes to an apparently contained number of states (though many transitions). E.g. at 30 clauses, determized form produced 7652 states and 136898 transitions, but after minimize that drops to 150 states and 2960 transitions. I tried to run dot on this FSA but it struggles Net/net the DFA approach is not usable in some cases (like this one); such users must use the JDK implementation. Maybe we should explore an re2j version too. Btw. if you're looking into this again, piggyback a change to Operations.determinize and replace LinkedList with an ArrayDeque, it certainly won't hurt. Excellent, I'll fold that in!
        Hide
        dweiss Dawid Weiss added a comment -

        Maybe we should explore an re2j version too.

        I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized Automaton... Sure, it wouldn't be able to protect against an explosion of states at compile-time, but it'd still be possible to protect against it at runtime (if too many states need to be tracked within the automaton, we could throw a matching exception). Note that a non-deterministic automaton for the above regular expression is actually pretty simple!

        Show
        dweiss Dawid Weiss added a comment - Maybe we should explore an re2j version too. I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized Automaton ... Sure, it wouldn't be able to protect against an explosion of states at compile-time, but it'd still be possible to protect against it at runtime (if too many states need to be tracked within the automaton, we could throw a matching exception). Note that a non-deterministic automaton for the above regular expression is actually pretty simple!
        Hide
        mikemccand Michael McCandless added a comment -

        Whoa, this issue almost dropped past the event horizon on my TODO list! I'll revive the patch and push soon ...

        I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized Automaton

        I think this is interesting, but let's explore it on a future issue?

        Show
        mikemccand Michael McCandless added a comment - Whoa, this issue almost dropped past the event horizon on my TODO list! I'll revive the patch and push soon ... I think it'd be more interesting to actually write a (simple!) matcher on top of a non-determinized Automaton I think this is interesting, but let's explore it on a future issue?
        Hide
        dweiss Dawid Weiss added a comment -

        I think this is interesting, but let's explore it on a future issue?

        Absolutely!

        Show
        dweiss Dawid Weiss added a comment - I think this is interesting, but let's explore it on a future issue? Absolutely!
        Hide
        dsmiley David Smiley added a comment -

        (Adrien) I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox.

        I think the factory isn't going to stand in the way of either tokenizer evolving. A problem with separate factories is that the name PatternTokenizerFactory is already an excellent name, nor does it have hints as to how it works. In general I don't like polluting the namespace with different implementations of effectively the same thing; the first impl to show up grabs the best name. The Factory provides an excellent opportunity to bridge these multiple implementations.

        Yet alas, my arguments aren't swaying anyone so go ahead.

        Show
        dsmiley David Smiley added a comment - (Adrien) I like the separate factory idea better, it makes it easier to evolve those two impls separately, eg. in the case that we decide to deprecate PatternTokenizer or to move it to sandbox. I think the factory isn't going to stand in the way of either tokenizer evolving. A problem with separate factories is that the name PatternTokenizerFactory is already an excellent name, nor does it have hints as to how it works. In general I don't like polluting the namespace with different implementations of effectively the same thing; the first impl to show up grabs the best name. The Factory provides an excellent opportunity to bridge these multiple implementations. Yet alas, my arguments aren't swaying anyone so go ahead.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 93fa72f77bd024aa09eef043c65c64a6524613dc in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=93fa72f ]

        LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation

        Show
        jira-bot ASF subversion and git services added a comment - Commit 93fa72f77bd024aa09eef043c65c64a6524613dc in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=93fa72f ] LUCENE-7465 : add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c24e03e6bf4d09e6f31eee8192bb6c0c4b2b6d27 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c24e03e ]

        LUCENE-7465: add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation

        Show
        jira-bot ASF subversion and git services added a comment - Commit c24e03e6bf4d09e6f31eee8192bb6c0c4b2b6d27 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c24e03e ] LUCENE-7465 : add SimplePatternTokenizer and SimpleSplitPatternTokenizer, for tokenization using Lucene's regexp/automaton implementation
        Hide
        steve_rowe Steve Rowe added a comment -

        My Jenkins found a reproducing seed on master for a TestRandomChains failure that implicates the new tokenizer:

           [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
           [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf'
           [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf'
           [junit4]   2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf'
           [junit4]   2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
           [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-17,5,TGRP-TestRandomChains]
           [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but was:<65>
           [junit4]   2> 	at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
           [junit4]   2> 	at org.junit.Assert.fail(Assert.java:93)
           [junit4]   2> 	at org.junit.Assert.failNotEquals(Assert.java:647)
           [junit4]   2> 	at org.junit.Assert.assertEquals(Assert.java:128)
           [junit4]   2> 	at org.junit.Assert.assertEquals(Assert.java:472)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
           [junit4]   2> 
           [junit4]   2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
           [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-18,5,TGRP-TestRandomChains]
           [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but was:<65>
           [junit4]   2> 	at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
           [junit4]   2> 	at org.junit.Assert.fail(Assert.java:93)
           [junit4]   2> 	at org.junit.Assert.failNotEquals(Assert.java:647)
           [junit4]   2> 	at org.junit.Assert.assertEquals(Assert.java:128)
           [junit4]   2> 	at org.junit.Assert.assertEquals(Assert.java:472)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
           [junit4]   2> 
           [junit4]   2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
           [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-19,5,TGRP-TestRandomChains]
           [junit4]   2> java.lang.AssertionError: finalOffset expected:<79> but was:<65>
           [junit4]   2> 	at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
           [junit4]   2> 	at org.junit.Assert.fail(Assert.java:93)
           [junit4]   2> 	at org.junit.Assert.failNotEquals(Assert.java:647)
           [junit4]   2> 	at org.junit.Assert.assertEquals(Assert.java:128)
           [junit4]   2> 	at org.junit.Assert.assertEquals(Assert.java:472)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
           [junit4]   2> 
           [junit4]   2> Exception from random analyzer: 
           [junit4]   2> charfilters=
           [junit4]   2>   org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@754b8c24)
           [junit4]   2>   org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(org.apache.lucene.analysis.MockCharFilter@6f0a7841)
           [junit4]   2> tokenizer=
           [junit4]   2>   org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizer(org.apache.lucene.util.automaton.Automaton@aa8d0c)
           [junit4]   2> filters=
           [junit4]   2> offsetsAreCorrect=true
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=3ABEF2F287EE4968 -Dtests.slow=true -Dtests.locale=es-US -Dtests.timezone=America/Montreal -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] ERROR   1.17s J1 | TestRandomChains.testRandomChainsWithLargeStrings <<<
           [junit4]    > Throwable #1: java.lang.RuntimeException: some thread(s) failed
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:562)
           [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=70, name=Thread-17, state=RUNNABLE, group=TGRP-TestRandomChains]
           [junit4]    > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65>
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable #3: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=72, name=Thread-19, state=RUNNABLE, group=TGRP-TestRandomChains]
           [junit4]    > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65>
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable #4: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=71, name=Thread-18, state=RUNNABLE, group=TGRP-TestRandomChains]
           [junit4]    > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65>
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)
           [junit4]   2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=false): {}, locale=es-US, timezone=America/Montreal
           [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=358995304,total=524288000
        
        Show
        steve_rowe Steve Rowe added a comment - My Jenkins found a reproducing seed on master for a TestRandomChains failure that implicates the new tokenizer: [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf' [junit4] 2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf' [junit4] 2> TEST FAIL: useCharFilter=false text='puzoh \u6a8b\u59e2\u96aa\u85f0\u614a\u9010\u7782\u5547 \uef27\uda09\uddd2\u9b9c\u056e\u33f0 W\udb24\udce6> \u2d12\u2d23\u2d05\u2d1c\u2d23 *\ud9f0\udc74\uea94\ub9c6 pev trjrbvcwb tzzntfd y|)]){1 </p> gmabf' [junit4] 2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-17,5,TGRP-TestRandomChains] [junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] 2> at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] 2> at org.junit.Assert.fail(Assert.java:93) [junit4] 2> at org.junit.Assert.failNotEquals(Assert.java:647) [junit4] 2> at org.junit.Assert.assertEquals(Assert.java:128) [junit4] 2> at org.junit.Assert.assertEquals(Assert.java:472) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510) [junit4] 2> [junit4] 2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-18,5,TGRP-TestRandomChains] [junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] 2> at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] 2> at org.junit.Assert.fail(Assert.java:93) [junit4] 2> at org.junit.Assert.failNotEquals(Assert.java:647) [junit4] 2> at org.junit.Assert.assertEquals(Assert.java:128) [junit4] 2> at org.junit.Assert.assertEquals(Assert.java:472) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510) [junit4] 2> [junit4] 2> feb 14, 2017 2:13:13 P.M. com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-19,5,TGRP-TestRandomChains] [junit4] 2> java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] 2> at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] 2> at org.junit.Assert.fail(Assert.java:93) [junit4] 2> at org.junit.Assert.failNotEquals(Assert.java:647) [junit4] 2> at org.junit.Assert.assertEquals(Assert.java:128) [junit4] 2> at org.junit.Assert.assertEquals(Assert.java:472) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510) [junit4] 2> [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.MockCharFilter(java.io.StringReader@754b8c24) [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(org.apache.lucene.analysis.MockCharFilter@6f0a7841) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizer(org.apache.lucene.util.automaton.Automaton@aa8d0c) [junit4] 2> filters= [junit4] 2> offsetsAreCorrect=true [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=3ABEF2F287EE4968 -Dtests.slow=true -Dtests.locale=es-US -Dtests.timezone=America/Montreal -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 1.17s J1 | TestRandomChains.testRandomChainsWithLargeStrings <<< [junit4] > Throwable #1: java.lang.RuntimeException: some thread(s) failed [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:562) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880) [junit4] > at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=70, name=Thread-17, state=RUNNABLE, group=TGRP-TestRandomChains] [junit4] > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] > at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable #3: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=72, name=Thread-19, state=RUNNABLE, group=TGRP-TestRandomChains] [junit4] > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] > at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510)Throwable #4: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=71, name=Thread-18, state=RUNNABLE, group=TGRP-TestRandomChains] [junit4] > Caused by: java.lang.AssertionError: finalOffset expected:<79> but was:<65> [junit4] > at __randomizedtesting.SeedInfo.seed([3ABEF2F287EE4968]:0) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:66) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:510) [junit4] 2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=false): {}, locale=es-US, timezone=America/Montreal [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=358995304,total=524288000
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Steve Rowe, I'll dig.

        Show
        mikemccand Michael McCandless added a comment - Thanks Steve Rowe , I'll dig.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5ca3ca205234694224341331980216a07c8e518b in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ca3ca2 ]

        LUCENE-7465: use the right random instance otheerwise we hit creepy test failures

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5ca3ca205234694224341331980216a07c8e518b in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ca3ca2 ] LUCENE-7465 : use the right random instance otheerwise we hit creepy test failures
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 74f208d716ef25aaead97e674c5296f2be54eb76 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74f208d ]

        LUCENE-7465: use the right random instance otheerwise we hit creepy test failures

        Show
        jira-bot ASF subversion and git services added a comment - Commit 74f208d716ef25aaead97e674c5296f2be54eb76 in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=74f208d ] LUCENE-7465 : use the right random instance otheerwise we hit creepy test failures
        Hide
        mikemccand Michael McCandless added a comment -

        OK I pushed a fix... sneaky wrong random instance usage.

        Show
        mikemccand Michael McCandless added a comment - OK I pushed a fix... sneaky wrong random instance usage.
        Hide
        steve_rowe Steve Rowe added a comment -

        Another reproducing TestRandomChains master seed, from https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/19011/:

           [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
           [junit4]   2> TEST FAIL: useCharFilter=false text='Sy \ud98b\udc04\uff52\u0384\u942fP\u040a\u0004\u0455 |uh)a)mrB- '
           [junit4]   2> Exception from random analyzer: 
           [junit4]   2> charfilters=
           [junit4]   2>   org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@29890127, [<ACRONYM_DEP>])
           [junit4]   2> tokenizer=
           [junit4]   2>   org.apache.lucene.analysis.pattern.SimplePatternTokenizer(org.apache.lucene.util.automaton.Automaton@9bd88db)
           [junit4]   2> filters=
           [junit4]   2> offsetsAreCorrect=true
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=821B9B2715E2264F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=el-GR -Dtests.timezone=America/Lima -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
           [junit4] FAILURE 0.15s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<<
           [junit4]    > Throwable #1: java.lang.AssertionError: finalOffset expected:<24> but was:<23>
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([821B9B2715E2264F:E84024364CAC06BC]:0)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
           [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
           [junit4]   2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=true): {}, locale=el-GR, timezone=America/Lima
           [junit4]   2> NOTE: Linux 4.4.0-53-generic amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=12,threads=1,free=457314736,total=508952576
        
        Show
        steve_rowe Steve Rowe added a comment - Another reproducing TestRandomChains master seed, from https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/19011/ : [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='Sy \ud98b\udc04\uff52\u0384\u942fP\u040a\u0004\u0455 |uh)a)mrB- ' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@29890127, [<ACRONYM_DEP>]) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.pattern.SimplePatternTokenizer(org.apache.lucene.util.automaton.Automaton@9bd88db) [junit4] 2> filters= [junit4] 2> offsetsAreCorrect=true [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=821B9B2715E2264F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=el-GR -Dtests.timezone=America/Lima -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] FAILURE 0.15s J2 | TestRandomChains.testRandomChainsWithLargeStrings <<< [junit4] > Throwable #1: java.lang.AssertionError: finalOffset expected:<24> but was:<23> [junit4] > at __randomizedtesting.SeedInfo.seed([821B9B2715E2264F:E84024364CAC06BC]:0) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:293) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:308) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:312) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:843) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:880) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=CheapBastard, sim=RandomSimilarity(queryNorm=true): {}, locale=el-GR, timezone=America/Lima [junit4] 2> NOTE: Linux 4.4.0-53-generic amd64/Oracle Corporation 1.8.0_121 (64-bit)/cpus=12,threads=1,free=457314736,total=508952576
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Steve Rowe; I'll have a look.

        Show
        mikemccand Michael McCandless added a comment - Thanks Steve Rowe ; I'll have a look.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 2d03aa21a2b674d36e201f6309e646f37771b73b in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d03aa2 ]

        LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input

        Show
        jira-bot ASF subversion and git services added a comment - Commit 2d03aa21a2b674d36e201f6309e646f37771b73b in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2d03aa2 ] LUCENE-7465 : fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c3028b32207b8837cdaf29918edd4e0cdc9621ad in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c3028b3 ]

        LUCENE-7465: fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input

        Show
        jira-bot ASF subversion and git services added a comment - Commit c3028b32207b8837cdaf29918edd4e0cdc9621ad in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c3028b3 ] LUCENE-7465 : fix corner case in SimplePattern/SplitTokenizer when lookahead hits end of input
        Hide
        mikemccand Michael McCandless added a comment -

        That test failure was actually a real bug in both SimplePatternTokenizer and SimpleSplitPatternTokenizer! Yay for TestRandomChains I pushed a fix.

        Show
        mikemccand Michael McCandless added a comment - That test failure was actually a real bug in both SimplePatternTokenizer and SimpleSplitPatternTokenizer ! Yay for TestRandomChains I pushed a fix.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development