Lucene - Core
  1. Lucene - Core
  2. LUCENE-5897

performance bug ("adversary") in StandardTokenizer

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9.1, 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      There seem to be some conditions (I don't know how rare or what conditions) that cause StandardTokenizer to essentially hang on input: I havent looked hard yet, but as its essentially a DFA I think something wierd might be going on.

      An easy way to reproduce is with 1MB of underscores, it will just hang forever.

        public void testWorthyAdversary() throws Exception {
          char buffer[] = new char[1024 * 1024];
          Arrays.fill(buffer, '_');
          int tokenCount = 0;
          Tokenizer ts = new StandardTokenizer();
          ts.setReader(new StringReader(new String(buffer)));
          ts.reset();
          while (ts.incrementToken()) {
            tokenCount++;
          }
          ts.end();
          ts.close();
          assertEquals(0, tokenCount);
        }
      

        Activity

        Hide
        Robert Muir added a comment -

        it seems stuck here:

        TRACE 301319:
                org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:756)
                org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:150)
                org.apache.lucene.analysis.core.TestStandardAnalyzer.testWorthyAdversary(TestStandardAnalyzer.java:286)
        

        This is in generated code: so I don't yet know if its something about our grammar or something in jflex itself?

        Show
        Robert Muir added a comment - it seems stuck here: TRACE 301319: org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:756) org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:150) org.apache.lucene.analysis.core.TestStandardAnalyzer.testWorthyAdversary(TestStandardAnalyzer.java:286) This is in generated code: so I don't yet know if its something about our grammar or something in jflex itself?
        Hide
        Robert Muir added a comment -

        I spent some time trying to debug it, didn't get very far.

        I know at least you can substitute any Extend_Num_Let for the underscore and it hangs. I don't yet know if other word break categories would have a similar issue, maybe even in other contexts.

        Show
        Robert Muir added a comment - I spent some time trying to debug it, didn't get very far. I know at least you can substitute any Extend_Num_Let for the underscore and it hangs. I don't yet know if other word break categories would have a similar issue, maybe even in other contexts.
        Hide
        Steve Rowe added a comment -

        I'm looking into it, I think it's a bug in JFlex. zzRefill(), which pulls data into and if necessary expands the buffer over which tokenization occurs, is being called repeatedly even though EOF has been reached.

        I'm going to see if this reproduces in Lucene 4.9 - I suspect I introduced the bug in JFlex 1.6. If so, the thing to do is likely revert the JFlex 1.5->1.6 changes (LUCENE-5770), since that hasn't been released yet.

        Show
        Steve Rowe added a comment - I'm looking into it, I think it's a bug in JFlex. zzRefill() , which pulls data into and if necessary expands the buffer over which tokenization occurs, is being called repeatedly even though EOF has been reached. I'm going to see if this reproduces in Lucene 4.9 - I suspect I introduced the bug in JFlex 1.6. If so, the thing to do is likely revert the JFlex 1.5->1.6 changes ( LUCENE-5770 ), since that hasn't been released yet.
        Hide
        Steve Rowe added a comment -

        I'm going to see if this reproduces in Lucene 4.9 - I suspect I introduced the bug in JFlex 1.6. If so, the thing to do is likely revert the JFlex 1.5->1.6 changes (LUCENE-5770), since that hasn't been released yet.

        Unfortunately, the hanging behavior occurs on 4.9 too, so reverting LUCENE-5770 won't help.

        Show
        Steve Rowe added a comment - I'm going to see if this reproduces in Lucene 4.9 - I suspect I introduced the bug in JFlex 1.6. If so, the thing to do is likely revert the JFlex 1.5->1.6 changes ( LUCENE-5770 ), since that hasn't been released yet. Unfortunately, the hanging behavior occurs on 4.9 too, so reverting LUCENE-5770 won't help.
        Hide
        Robert Muir added a comment -

        I think it might be older? I tried 4.7 branch really quick, and it hung too.

        Show
        Robert Muir added a comment - I think it might be older? I tried 4.7 branch really quick, and it hung too.
        Hide
        Steve Rowe added a comment -

        This looks exactly like LUCENE-5400: very slow tokenization when a rule has to search to end of stream for a condition that doesn't occur.

        Show
        Steve Rowe added a comment - This looks exactly like LUCENE-5400 : very slow tokenization when a rule has to search to end of stream for a condition that doesn't occur.
        Hide
        Robert Muir added a comment -

        OK, that makes sense. Can we do something crazy during generation of jflex (e.g. inject a little code) to bound this to at least maxTokenLength() ?

        Show
        Robert Muir added a comment - OK, that makes sense. Can we do something crazy during generation of jflex (e.g. inject a little code) to bound this to at least maxTokenLength() ?
        Hide
        Steve Rowe added a comment -

        I'll try to figure out a way to limit the search, as you say, to to maxTokenLength(). I worry about two things though, both of which are currently handled (though badly in these adversary cases):

        1. in a rule with alternates, one of which is satisfied below the limit, the satisfied alternate should produce a match when a partially matching alternate exceeds the limit and is aborted.
        2. When rule A matches partially, exceeds the limit, and is aborted, and rule B matches a prefix that is under the limit, rule B should produce a match.
        Show
        Steve Rowe added a comment - I'll try to figure out a way to limit the search, as you say, to to maxTokenLength(). I worry about two things though, both of which are currently handled (though badly in these adversary cases): in a rule with alternates, one of which is satisfied below the limit, the satisfied alternate should produce a match when a partially matching alternate exceeds the limit and is aborted. When rule A matches partially, exceeds the limit, and is aborted, and rule B matches a prefix that is under the limit, rule B should produce a match.
        Hide
        Robert Muir added a comment -

        I agree its not ideal.

        can it be based on the way the rules are encoded in our grammar?

        I know for example, if i substitute latest unicode breakiterator instead, it doesn't have the problem, but I know that has a different (typically slower) representation. But the rules (IIRC) use a chaining mechanism which is hard to think about.

        Show
        Robert Muir added a comment - I agree its not ideal. can it be based on the way the rules are encoded in our grammar? I know for example, if i substitute latest unicode breakiterator instead, it doesn't have the problem, but I know that has a different (typically slower) representation. But the rules (IIRC) use a chaining mechanism which is hard to think about.
        Hide
        Steve Rowe added a comment -

        can it be based on the way the rules are encoded in our grammar?

        I don't know how to do that - as I mentioned on LUCENE-5400, adding large repeat counts to sub-regexes made JFlex OOM at generation time. Were you thinking of something other than repeat counts?

        I'm thinking it should be possible to abuse JFlex's buffer handling to just never grow the buffer beyond the initial size, but still allow the contents to be shifted to enable (maximally) buffer-length matches. This would have a nice secondary effect of reducing max memory usage. If I can make it work, I'll add a generation option for this to JFlex.

        Show
        Steve Rowe added a comment - can it be based on the way the rules are encoded in our grammar? I don't know how to do that - as I mentioned on LUCENE-5400 , adding large repeat counts to sub-regexes made JFlex OOM at generation time. Were you thinking of something other than repeat counts? I'm thinking it should be possible to abuse JFlex's buffer handling to just never grow the buffer beyond the initial size, but still allow the contents to be shifted to enable (maximally) buffer-length matches. This would have a nice secondary effect of reducing max memory usage. If I can make it work, I'll add a generation option for this to JFlex.
        Hide
        Robert Muir added a comment -

        Well, I guess one concern is the 'adversary' case but I'm a little concerned the behavior might impact ordinary performance: so I'm just stretching a bit and trying to figure out how com.icu.ibm.text.BreakIterator (which impls the same algo) doesnt' get hung in such an adversary case.

        I looked at http://icu-project.org/docs/papers/text_boundary_analysis_in_java/

        especially: "If the current state is an accepting state, the break position is after that character. Otherwise, the break position is after the last character that caused a transition to an accepting state. (In other words, we keep track of the break position, updating it to after the current position every time we enter an accepting state. This is called "marking" the position.)"

        So more generally, can we optimize the general case to also remove what appears to be a backtracking algo? I know jflex is more general than what ICU offers, so its like comparing apples and oranges, but i can't help but wonder...

        Show
        Robert Muir added a comment - Well, I guess one concern is the 'adversary' case but I'm a little concerned the behavior might impact ordinary performance: so I'm just stretching a bit and trying to figure out how com.icu.ibm.text.BreakIterator (which impls the same algo) doesnt' get hung in such an adversary case. I looked at http://icu-project.org/docs/papers/text_boundary_analysis_in_java/ especially: "If the current state is an accepting state, the break position is after that character. Otherwise, the break position is after the last character that caused a transition to an accepting state. (In other words, we keep track of the break position, updating it to after the current position every time we enter an accepting state. This is called "marking" the position.)" So more generally, can we optimize the general case to also remove what appears to be a backtracking algo? I know jflex is more general than what ICU offers, so its like comparing apples and oranges, but i can't help but wonder...
        Hide
        Steve Rowe added a comment -

        So more generally, can we optimize the general case to also remove what appears to be a backtracking algo? I know jflex is more general than what ICU offers, so its like comparing apples and oranges, but i can't help but wonder...

        Sorry, I don't know enough about how the automaton is constructed and run to know if this is possible.

        Show
        Steve Rowe added a comment - So more generally, can we optimize the general case to also remove what appears to be a backtracking algo? I know jflex is more general than what ICU offers, so its like comparing apples and oranges, but i can't help but wonder... Sorry, I don't know enough about how the automaton is constructed and run to know if this is possible.
        Hide
        Steve Rowe added a comment -

        I removed the buffer expansion logic in StandardTokenizerImpl.zzRefill(), and the tokenizer still functions - as I had hoped, partial match searches are limited to the buffer size:

        @@ -509,16 +509,6 @@
               zzStartRead = 0;
             }
         
        -    /* is the buffer big enough? */
        -    if (zzCurrentPos >= zzBuffer.length - zzFinalHighSurrogate) {
        -      /* if not: blow it up */
        -      char newBuffer[] = new char[zzBuffer.length*2];
        -      System.arraycopy(zzBuffer, 0, newBuffer, 0, zzBuffer.length);
        -      zzBuffer = newBuffer;
        -      zzEndRead += zzFinalHighSurrogate;
        -      zzFinalHighSurrogate = 0;
        -    }
        -
             /* fill the buffer with new input */
             int requested = zzBuffer.length - zzEndRead;           
             int totalRead = 0;
        

        and ran Robert's testWorthyAdversary() with the input length ranging from 100k to 3.2M chars, and varying the buffer size from 4k chars (the default) to 255, compared to the current implementation, where unlimited buffer expansion is allowed (NBE = no buffer expansion; times are in seconds; Oracle Java 1.7.0_55; OS X 10.9.4):

        Input chars current impl. 4k buff, NBE 2k buff, NBE 1k buff, NBE 255 buff, NBE
        100k 29s 3s 1s <1s <1s
        200k 136s 5s 3s 1s <1s
        400k 547s 11s 5s 3s 1s
        800k 2,272s 22s 11s 5s 1s
        1,600k 9,000s (est.) 43s 23s 11s 3s
        3,200k 40,000s (est.) 91s 43s 22s 6s

        I didn't actually run the test against the current implementation with 1.6M and 3.2M input chars - the numbers above with (est.) after them are estimates - but for the ones I did measure, doubling the input length roughly quadruples the run time.

        By contrast, when the buffer length is limited, doubling the input length only doubles the run time.

        When the buffer length is limited, doubling the buffer length doubles the run time.

        Based on this, I'd like to introduce a new max buffer size setter to StandardTokenizer, which defaults to the initial buffer size. That way, by default buffer expansion is disabled, but can be re-enabled by setting a max buffer size larger than the initial buffer size.

        I ran luceneutil's TestAnalyzerPerf, just testing StandardAnalyzer using enwiki-20130102-lines.txt, with unpatched trunk against trunk patched to disable buffer expansion, and with a buffer size of 255 (the default max token size), 5 runs each:

          Million tokens/sec, trunk Million tokens/sec, patched
        run 1 7.162 7.020
        run 2 7.079 7.245
        run 3 7.381 7.200
        run 4 7.352 7.192
        run 5 7.160 7.169
        mean 7.227 7.166
        stddev 0.1323 0.08589

        These are pretty noisy, but comparing the best throughput numbers, the patched version has 1.8% lower throughput.

        Based on the above, I'd also like to:

        1. set the initial buffer size at the max token length
        2. when basing the initial buffer size on the max token length, don't go above 1M or 2M chars, to guard against people specifying Integer.MAX_VALUE for the max token length

          and from above:

        3. add a max buffer size setter to StandardTokenizer, which defaults to the initial buffer size.
        Show
        Steve Rowe added a comment - I removed the buffer expansion logic in StandardTokenizerImpl.zzRefill() , and the tokenizer still functions - as I had hoped, partial match searches are limited to the buffer size: @@ -509,16 +509,6 @@ zzStartRead = 0; } - /* is the buffer big enough? */ - if (zzCurrentPos >= zzBuffer.length - zzFinalHighSurrogate) { - /* if not: blow it up */ - char newBuffer[] = new char [zzBuffer.length*2]; - System .arraycopy(zzBuffer, 0, newBuffer, 0, zzBuffer.length); - zzBuffer = newBuffer; - zzEndRead += zzFinalHighSurrogate; - zzFinalHighSurrogate = 0; - } - /* fill the buffer with new input */ int requested = zzBuffer.length - zzEndRead; int totalRead = 0; and ran Robert's testWorthyAdversary() with the input length ranging from 100k to 3.2M chars, and varying the buffer size from 4k chars (the default) to 255, compared to the current implementation, where unlimited buffer expansion is allowed (NBE = no buffer expansion; times are in seconds; Oracle Java 1.7.0_55; OS X 10.9.4): Input chars current impl. 4k buff, NBE 2k buff, NBE 1k buff, NBE 255 buff, NBE 100k 29s 3s 1s <1s <1s 200k 136s 5s 3s 1s <1s 400k 547s 11s 5s 3s 1s 800k 2,272s 22s 11s 5s 1s 1,600k 9,000s (est.) 43s 23s 11s 3s 3,200k 40,000s (est.) 91s 43s 22s 6s I didn't actually run the test against the current implementation with 1.6M and 3.2M input chars - the numbers above with (est.) after them are estimates - but for the ones I did measure, doubling the input length roughly quadruples the run time. By contrast, when the buffer length is limited, doubling the input length only doubles the run time. When the buffer length is limited, doubling the buffer length doubles the run time. Based on this, I'd like to introduce a new max buffer size setter to StandardTokenizer, which defaults to the initial buffer size. That way, by default buffer expansion is disabled, but can be re-enabled by setting a max buffer size larger than the initial buffer size. I ran luceneutil's TestAnalyzerPerf , just testing StandardAnalyzer using enwiki-20130102-lines.txt, with unpatched trunk against trunk patched to disable buffer expansion, and with a buffer size of 255 (the default max token size), 5 runs each:   Million tokens/sec, trunk Million tokens/sec, patched run 1 7.162 7.020 run 2 7.079 7.245 run 3 7.381 7.200 run 4 7.352 7.192 run 5 7.160 7.169 mean 7.227 7.166 stddev 0.1323 0.08589 These are pretty noisy, but comparing the best throughput numbers, the patched version has 1.8% lower throughput. Based on the above, I'd also like to: set the initial buffer size at the max token length when basing the initial buffer size on the max token length, don't go above 1M or 2M chars, to guard against people specifying Integer.MAX_VALUE for the max token length and from above: add a max buffer size setter to StandardTokenizer, which defaults to the initial buffer size.
        Hide
        Robert Muir added a comment -

        do we need a separate max buffer size parameter? can it just be an impl detail based on max token length?

        Show
        Robert Muir added a comment - do we need a separate max buffer size parameter? can it just be an impl detail based on max token length?
        Hide
        Steve Rowe added a comment -

        do we need a separate max buffer size parameter? can it just be an impl detail based on max token length?

        It depends on whether we think anybody will want the (apparently minor) benefit of having a larger buffer, regardless of max token length

        Show
        Steve Rowe added a comment - do we need a separate max buffer size parameter? can it just be an impl detail based on max token length? It depends on whether we think anybody will want the (apparently minor) benefit of having a larger buffer, regardless of max token length
        Hide
        Steve Rowe added a comment -

        Oh, and one other side effect that people might want: when buffer size is larger than max token length, too-large tokens are not emitted, and no attempt is made to find smaller matching prefixes.

        These two seem like very minor benefits for a small audience, so I'm fine going without a separate max buffer size parameter.

        Show
        Steve Rowe added a comment - Oh, and one other side effect that people might want: when buffer size is larger than max token length, too-large tokens are not emitted, and no attempt is made to find smaller matching prefixes. These two seem like very minor benefits for a small audience, so I'm fine going without a separate max buffer size parameter.
        Hide
        Steve Rowe added a comment -

        Trunk patch, fixes both this issue and LUCENE-5400:

        • modifies jflex generation to disable scanner buffer expansion
        • when StandardTokenizerInterface.setMaxTokenLength() is called, the scanner's buffer size is also modified, but is limited to max 1M chars
        • added randomized tests for StandardTokenizer and UAX29URLEmailTokenizer.
        • I tried to find problematic text sequences for the other JFlex grammars (HTMLStripCharFilter, ClassicTokenizer, and WikipediaTokenizer), but nothing I tried worked, so I left these as-is.

        All analysis-common tests pass, as does precommit (after locally patching some javadoc problems unrelated to this issue). I'll commit to trunk and branch_4x after I've run the whole test suite.

        I'd like to include this fix in 4.10.

        Show
        Steve Rowe added a comment - Trunk patch, fixes both this issue and LUCENE-5400 : modifies jflex generation to disable scanner buffer expansion when StandardTokenizerInterface.setMaxTokenLength() is called, the scanner's buffer size is also modified, but is limited to max 1M chars added randomized tests for StandardTokenizer and UAX29URLEmailTokenizer. I tried to find problematic text sequences for the other JFlex grammars (HTMLStripCharFilter, ClassicTokenizer, and WikipediaTokenizer), but nothing I tried worked, so I left these as-is. All analysis-common tests pass, as does precommit (after locally patching some javadoc problems unrelated to this issue). I'll commit to trunk and branch_4x after I've run the whole test suite. I'd like to include this fix in 4.10.
        Hide
        ASF subversion and git services added a comment -

        Commit 1619730 from Use account "steve_rowe" instead in branch 'dev/trunk'
        [ https://svn.apache.org/r1619730 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences.

        Show
        ASF subversion and git services added a comment - Commit 1619730 from Use account "steve_rowe" instead in branch 'dev/trunk' [ https://svn.apache.org/r1619730 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences.
        Hide
        ASF subversion and git services added a comment -

        Commit 1619773 from Use account "steve_rowe" instead in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1619773 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged trunk r1619730)

        Show
        ASF subversion and git services added a comment - Commit 1619773 from Use account "steve_rowe" instead in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619773 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged trunk r1619730)
        Hide
        Ryan Ernst added a comment -

        Thanks for the hard work Steve! I will merge this over into the 4.10 branch.

        Show
        Ryan Ernst added a comment - Thanks for the hard work Steve! I will merge this over into the 4.10 branch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1619836 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1619836 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)

        Show
        ASF subversion and git services added a comment - Commit 1619836 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619836 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)
        Hide
        ASF subversion and git services added a comment -

        Commit 1619840 from Ryan Ernst in branch 'dev/trunk'
        [ https://svn.apache.org/r1619840 ]

        LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0

        Show
        ASF subversion and git services added a comment - Commit 1619840 from Ryan Ernst in branch 'dev/trunk' [ https://svn.apache.org/r1619840 ] LUCENE-5672 , LUCENE-5897 , LUCENE-5400 : move changes entry to 4.10.0
        Hide
        ASF subversion and git services added a comment -

        Commit 1619841 from Ryan Ernst in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1619841 ]

        LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0

        Show
        ASF subversion and git services added a comment - Commit 1619841 from Ryan Ernst in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619841 ] LUCENE-5672 , LUCENE-5897 , LUCENE-5400 : move changes entry to 4.10.0
        Hide
        ASF subversion and git services added a comment -

        Commit 1619842 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1619842 ]

        LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0

        Show
        ASF subversion and git services added a comment - Commit 1619842 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619842 ] LUCENE-5672 , LUCENE-5897 , LUCENE-5400 : move changes entry to 4.10.0
        Hide
        Steve Rowe added a comment -

        Thanks for including in 4.10 and doing the backport work, Ryan Ernst.

        Show
        Steve Rowe added a comment - Thanks for including in 4.10 and doing the backport work, Ryan Ernst .
        Hide
        Michael McCandless added a comment -

        Reopen to backport to 4.9.1...

        Show
        Michael McCandless added a comment - Reopen to backport to 4.9.1...
        Hide
        Michael McCandless added a comment -

        Hmm ... merge conflicts on backport ... Use account "steve_rowe" instead maybe you can try to backport? The conflicts seem to be only in the autogen'd sources so maybe I just need to backport and regen?

        Show
        Michael McCandless added a comment - Hmm ... merge conflicts on backport ... Use account "steve_rowe" instead maybe you can try to backport? The conflicts seem to be only in the autogen'd sources so maybe I just need to backport and regen?
        Hide
        Steve Rowe added a comment -

        Sure, I'll do the backport.

        Show
        Steve Rowe added a comment - Sure, I'll do the backport.
        Hide
        Steve Rowe added a comment -

        LUCENE-5770 (JFlex 1.6 upgrade) happened in 4.10, so backporting to 4.9 will require some changes.

        Show
        Steve Rowe added a comment - LUCENE-5770 (JFlex 1.6 upgrade) happened in 4.10, so backporting to 4.9 will require some changes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1625458 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9'
        [ https://svn.apache.org/r1625458 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)

        Show
        ASF subversion and git services added a comment - Commit 1625458 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9' [ https://svn.apache.org/r1625458 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)
        Hide
        ASF subversion and git services added a comment -

        Commit 1625586 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9'
        [ https://svn.apache.org/r1625586 ]

        LUCENE-5897, LUCENE-5400: change JFlex-generated source munging so that zzRefill() doesn't call Reader.read(buffer,start,len) with len=0

        Show
        ASF subversion and git services added a comment - Commit 1625586 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9' [ https://svn.apache.org/r1625586 ] LUCENE-5897 , LUCENE-5400 : change JFlex-generated source munging so that zzRefill() doesn't call Reader.read(buffer,start,len) with len=0
        Hide
        Michael McCandless added a comment -

        Bulk close for Lucene/Solr 4.9.1 release

        Show
        Michael McCandless added a comment - Bulk close for Lucene/Solr 4.9.1 release

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development