Lucene - Core
  1. Lucene - Core
  2. LUCENE-5400

Long text matching email local-part rule in UAX29URLEmailTokenizer causes extremely slow tokenization

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.5
    • Fix Version/s: 4.9.1, 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a pretty nasty bug, and causes the cluster to stop accepting updates. I'm not sure how to consistently reproduce it but I have done so numerous times. Switching to a whitespace tokenizer improved indexing speed, and I never got the issue again.

      I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous versions of Solr, and have finally narrowed down the problem to this code, which affects many/all versions of Solr.

      When the thread hits this issue it uses 100% CPU, restarting the node which has the error allows indexing to continue until hit again. Here is thread dump:

      http-bio-8080-exec-45 (201)

      org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken​(UAX29URLEmailTokenizerImpl.java:4343)
      org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken​(UAX29URLEmailTokenizer.java:147)
      org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken​(FilteringTokenFilter.java:82)
      org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken​(LowerCaseFilter.java:54)
      org.apache.lucene.index.DocInverterPerField.processFields​(DocInverterPerField.java:174)
      org.apache.lucene.index.DocFieldProcessor.processDocument​(DocFieldProcessor.java:248)
      org.apache.lucene.index.DocumentsWriterPerThread.updateDocument​(DocumentsWriterPerThread.java:253)
      org.apache.lucene.index.DocumentsWriter.updateDocument​(DocumentsWriter.java:453)
      org.apache.lucene.index.IndexWriter.updateDocument​(IndexWriter.java:1517)
      org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:217)
      org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69)
      org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51)
      org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:583)
      org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:719)
      org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:449)
      org.apache.solr.handler.loader.JavabinLoader$1.update​(JavabinLoader.java:89)
      org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:151)
      org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator​(JavaBinUpdateRequestCodec.java:131)
      org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:221)
      org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList​(JavaBinUpdateRequestCodec.java:116)
      org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:186)
      org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:112)
      org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(JavaBinUpdateRequestCodec.java:158)
      org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(JavabinLoader.java:99)
      org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:58)
      org.apache.solr.handler.UpdateRequestHandler$1.load​(UpdateRequestHandler.java:92)
      org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(ContentStreamHandlerBase.java:74)
      org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135)
      org.apache.solr.core.SolrCore.execute​(SolrCore.java:1859)
      org.apache.solr.servlet.SolrDispatchFilter.execute​(SolrDispatchFilter.java:703)
      org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:406)
      org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:195)
      org.apache.catalina.core.ApplicationFilterChain.internalDoFilter​(ApplicationFilterChain.java:243)
      org.apache.catalina.core.ApplicationFilterChain.doFilter​(ApplicationFilterChain.java:210)
      org.apache.catalina.core.StandardWrapperValve.invoke​(StandardWrapperValve.java:222)
      org.apache.catalina.core.StandardContextValve.invoke​(StandardContextValve.java:123)
      org.apache.catalina.core.StandardHostValve.invoke​(StandardHostValve.java:171)
      org.apache.catalina.valves.ErrorReportValve.invoke​(ErrorReportValve.java:99)
      org.apache.catalina.valves.AccessLogValve.invoke​(AccessLogValve.java:953)
      org.apache.catalina.core.StandardEngineValve.invoke​(StandardEngineValve.java:118)
      org.apache.catalina.connector.CoyoteAdapter.service​(CoyoteAdapter.java:408)
      org.apache.coyote.http11.AbstractHttp11Processor.process​(AbstractHttp11Processor.java:1023)
      org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process​(AbstractProtocol.java:589)
      org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run​(JIoEndpoint.java:312)
      java.util.concurrent.ThreadPoolExecutor.runWorker​(Unknown Source)
      java.util.concurrent.ThreadPoolExecutor$Worker.run​(Unknown Source)
      java.lang.Thread.run​(Unknown Source)

        Activity

        Hide
        Chris Geeringh added a comment -

        Googling I found someone hit the same issue with elasticsearch, https://gist.github.com/jeremy/2925923

        Show
        Chris Geeringh added a comment - Googling I found someone hit the same issue with elasticsearch, https://gist.github.com/jeremy/2925923
        Hide
        Hoss Man added a comment -

        Ouch!

        Chris: You mentioned "deadlocks" but your stack trace doesn't have enough detail to be clear if this is truly a deadlocked thread situation or not – if so, your thread dump should show the locks in contention.

        I'm not very familiar with this code, but my gut guess based on your description (particularly the 100% CPU) is that there isn't actually a deadlock, but that in some input causes the tokenizer to go into an infinite loop.

        I don't suppose you could post any example input that you are feeding to this Tokenizer that causes the problem? (I'm guessing not since it's likely to be a big pile of email addresses that shouldn't be posted publicly)

        Show
        Hoss Man added a comment - Ouch! Chris: You mentioned "deadlocks" but your stack trace doesn't have enough detail to be clear if this is truly a deadlocked thread situation or not – if so, your thread dump should show the locks in contention. I'm not very familiar with this code, but my gut guess based on your description (particularly the 100% CPU) is that there isn't actually a deadlock, but that in some input causes the tokenizer to go into an infinite loop. I don't suppose you could post any example input that you are feeding to this Tokenizer that causes the problem? (I'm guessing not since it's likely to be a big pile of email addresses that shouldn't be posted publicly)
        Hide
        Hoss Man added a comment -

        I don't suppose you could post any example input that you are feeding to this Tokenizer that causes the problem? (I'm guessing not since it's likely to be a big pile of email addresses that shouldn't be posted publicly)

        FWIW: If you have a manageable chunk of sample data that semi-consistently reproduces the problem – but you can't share it publicly on the open internet, please let us know anyway: if you could privately share it with a coupld of the devs (i'm thinking rmuir & sarowe) they might be able to figure out the problem and create new test cases w/o using your actual data.

        Show
        Hoss Man added a comment - I don't suppose you could post any example input that you are feeding to this Tokenizer that causes the problem? (I'm guessing not since it's likely to be a big pile of email addresses that shouldn't be posted publicly) FWIW: If you have a manageable chunk of sample data that semi-consistently reproduces the problem – but you can't share it publicly on the open internet, please let us know anyway: if you could privately share it with a coupld of the devs (i'm thinking rmuir & sarowe) they might be able to figure out the problem and create new test cases w/o using your actual data.
        Hide
        Robert Muir added a comment -

        and either way, also any configuration information could help:

        • configuration information (what flags being passed to the tokenizer)
        • approximate size of documents (1KB or 22MB or whatever)
        • jvm version
        Show
        Robert Muir added a comment - and either way, also any configuration information could help: configuration information (what flags being passed to the tokenizer) approximate size of documents (1KB or 22MB or whatever) jvm version
        Hide
        Chris Geeringh added a comment -

        Been busy with an international relocation, however back up and running and got back to this.

        The infinite loop is hit within the loop on line 4305. I have found the offending text, and it is not email addresses, but rather the source code of an html page which has been URLEncoded. Should be relatively easy to reproduce (url encode this pages source code for example). If you need the exact text I am using, I can provide it privately.

        As a stop gap, since this text would never be searched, I'm detecting it and not pushing it up to solr.

        To answer above Q's, im running on Linux, JVM version 7 update 25, docs range in size from 10KB to 4MB, and not passing any flags to the tokenizer.

        Show
        Chris Geeringh added a comment - Been busy with an international relocation, however back up and running and got back to this. The infinite loop is hit within the loop on line 4305. I have found the offending text, and it is not email addresses, but rather the source code of an html page which has been URLEncoded. Should be relatively easy to reproduce (url encode this pages source code for example). If you need the exact text I am using, I can provide it privately. As a stop gap, since this text would never be searched, I'm detecting it and not pushing it up to solr. To answer above Q's, im running on Linux, JVM version 7 update 25, docs range in size from 10KB to 4MB, and not passing any flags to the tokenizer.
        Hide
        Steve Rowe added a comment -

        Chris Geeringh privately sent me a document that triggers this problem. The document consists of an HTML snippet containing a <script> block, which contains a 3-megabyte-long URL-encoded string in single-quotes, given as a parameter to a javascript function defined elsewhere. (The purpose of the javascript function is to URL-decode the string.)

        When I run this text through UAX29URLEmailTokenizer, it doesn't actually hang - it just tokenizes extremely slowly, consuming less than 100 characters per second on my laptop. I didn't wait long enough to find out, but I estimate the average scan rate over the entire text is on the order of 200 characters per second, so it would probably take about 4 hours to finish. (I also sent the same text through StandardTokenizer, which fortunately does not exhibit the slow tokenization behavior.) To convince myself that this is not an endless loop of some kind, I ran shorter runs (hundreds of chars) of URL-encoded text through UAX29URLEmailTokenizer, and they successfully finished.

        I guessed that the problem was with email addresses, so I commented out that part of the UAX29URLEmailTokenizer specification, and that caused the text to be scanned at the same speed as StandardTokenizer.

        The email rule in UAX29URLEmailTokenizer is basically the sequence <local-part>, "@", <domain>. What's happening is that the entire 3-MB-long URL-encoded string matches <local-part> (the stuff before the "@" in an email address), so for each "%XX" URL-encoded byte, the scanner scans through most of the remaining text looking for a "@" character, then gives up when it reaches the end of the URL-encoded string without finding one, and finally falls back to tokenizing "XX" as <ALPHANUM>. The scanner then starts again trying to match an email address over the remainder of the URL-encoded string, and so on. So it's not much of a surprise that this is slow.

        RFC5321 says:

        4.5.3.1.  Size Limits and Minimums
        
           There are several objects that have required minimum/maximum sizes.
           Every implementation MUST be able to receive objects of at least
           these sizes.  Objects larger than these sizes SHOULD be avoided when
           possible.  However, some Internet mail constructs such as encoded
           X.400 addresses (RFC 2156 [35]) will often require larger objects.
           Clients MAY attempt to transmit these, but MUST be prepared for a
           server to reject them if they cannot be handled by it.  To the
           maximum extent possible, implementation techniques that impose no
           limits on the length of these objects should be used.
        
           Extensions to SMTP may involve the use of characters that occupy more
           than a single octet each.  This section therefore specifies lengths
           in octets where absolute lengths, rather than character counts, are
           intended.
        
        4.5.3.1.1.  Local-part
        
           The maximum total length of a user name or other local-part is 64
           octets.
        

        So local-parts of email addresses that are going to work everywhere are effectively limited to 64 bytes. (Section 3 of RFC3696 says the same thing.)

        One possible solution to this problem is to limit the allowable length of the local-part. Currently the rule looks like:

        EMAILquotedString = [\"] ([\u0001-\u0008\u000B\u000C\u000E-\u0021\u0023-\u005B\u005D-\u007E] | [\\] [\u0000-\u007F])* [\"]
        EMAILatomText = [A-Za-z0-9!#$%&'*+-/=?\^_`{|}~]
        EMAILlabel = {EMAILatomText}+ | {EMAILquotedString}
        EMAILlocalPart = {EMAILlabel} ("." {EMAILlabel})*
        

        When I try to limit EMAILlabel as follows, JFlex takes forever (minutes) trying to generate the scanner, but then eventually OOMs, even with env. var. ANT_OPT=-Xmx2g (I didn't try larger):

        EMAILlabel = {EMAILatomText}{1,64} | {EMAILquotedString}
        

        (Note that EMAILquotedString has the same unlimited length problem - really long quoted ASCII strings could result in the same extremely slow tokenization behavior.)

        I think a solution could include a rule matching a fixed-length longer-than-maximum local-part, the action for which sets a lexical state where email addresses aren't allowed, and then pushes back the matched text onto the input stream. I haven't figured out exactly how to do this yet, though.

        I'd welcome other ideas

        Show
        Steve Rowe added a comment - Chris Geeringh privately sent me a document that triggers this problem. The document consists of an HTML snippet containing a <script> block, which contains a 3-megabyte-long URL-encoded string in single-quotes, given as a parameter to a javascript function defined elsewhere. (The purpose of the javascript function is to URL-decode the string.) When I run this text through UAX29URLEmailTokenizer , it doesn't actually hang - it just tokenizes extremely slowly, consuming less than 100 characters per second on my laptop. I didn't wait long enough to find out, but I estimate the average scan rate over the entire text is on the order of 200 characters per second, so it would probably take about 4 hours to finish. (I also sent the same text through StandardTokenizer , which fortunately does not exhibit the slow tokenization behavior.) To convince myself that this is not an endless loop of some kind, I ran shorter runs (hundreds of chars) of URL-encoded text through UAX29URLEmailTokenizer , and they successfully finished. I guessed that the problem was with email addresses, so I commented out that part of the UAX29URLEmailTokenizer specification, and that caused the text to be scanned at the same speed as StandardTokenizer . The email rule in UAX29URLEmailTokenizer is basically the sequence <local-part>, "@", <domain> . What's happening is that the entire 3-MB-long URL-encoded string matches <local-part> (the stuff before the "@" in an email address), so for each "%XX" URL-encoded byte, the scanner scans through most of the remaining text looking for a "@" character, then gives up when it reaches the end of the URL-encoded string without finding one, and finally falls back to tokenizing "XX" as <ALPHANUM> . The scanner then starts again trying to match an email address over the remainder of the URL-encoded string, and so on. So it's not much of a surprise that this is slow. RFC5321 says: 4.5.3.1. Size Limits and Minimums There are several objects that have required minimum/maximum sizes. Every implementation MUST be able to receive objects of at least these sizes. Objects larger than these sizes SHOULD be avoided when possible. However, some Internet mail constructs such as encoded X.400 addresses (RFC 2156 [35]) will often require larger objects. Clients MAY attempt to transmit these, but MUST be prepared for a server to reject them if they cannot be handled by it. To the maximum extent possible, implementation techniques that impose no limits on the length of these objects should be used. Extensions to SMTP may involve the use of characters that occupy more than a single octet each. This section therefore specifies lengths in octets where absolute lengths, rather than character counts, are intended. 4.5.3.1.1. Local-part The maximum total length of a user name or other local-part is 64 octets. So local-parts of email addresses that are going to work everywhere are effectively limited to 64 bytes. ( Section 3 of RFC3696 says the same thing.) One possible solution to this problem is to limit the allowable length of the local-part. Currently the rule looks like: EMAILquotedString = [\"] ([\u0001-\u0008\u000B\u000C\u000E-\u0021\u0023-\u005B\u005D-\u007E] | [\\] [\u0000-\u007F])* [\"] EMAILatomText = [A-Za-z0-9!#$%&'*+-/=?\^_`{|}~] EMAILlabel = {EMAILatomText}+ | {EMAILquotedString} EMAILlocalPart = {EMAILlabel} ("." {EMAILlabel})* When I try to limit EMAILlabel as follows, JFlex takes forever (minutes) trying to generate the scanner, but then eventually OOMs, even with env. var. ANT_OPT=-Xmx2g (I didn't try larger): EMAILlabel = {EMAILatomText}{1,64} | {EMAILquotedString} (Note that EMAILquotedString has the same unlimited length problem - really long quoted ASCII strings could result in the same extremely slow tokenization behavior.) I think a solution could include a rule matching a fixed-length longer-than-maximum local-part, the action for which sets a lexical state where email addresses aren't allowed, and then pushes back the matched text onto the input stream. I haven't figured out exactly how to do this yet, though. I'd welcome other ideas
        Hide
        Uwe Schindler added a comment -

        This is a Lucene issue, we should move it to Lucene.

        Show
        Uwe Schindler added a comment - This is a Lucene issue, we should move it to Lucene.
        Hide
        Chris Geeringh added a comment -

        No doubt there is a Lucene issue here.

        I would have thought that as this can put an entire Solr Cloud out of action there is room for this being a Solr architecture issue too.

        Show
        Chris Geeringh added a comment - No doubt there is a Lucene issue here. I would have thought that as this can put an entire Solr Cloud out of action there is room for this being a Solr architecture issue too.
        Hide
        Steve Rowe added a comment -

        This is a Lucene issue, we should move it to Lucene.

        I agree, I'll move it and update the issue title to reflect this, and also the fact that it's not hanging, but rather tokenizing very slowly.

        Show
        Steve Rowe added a comment - This is a Lucene issue, we should move it to Lucene. I agree, I'll move it and update the issue title to reflect this, and also the fact that it's not hanging, but rather tokenizing very slowly.
        Hide
        Edu Garcia added a comment -

        Hi.

        We've hit this bug in Atlassian Confluence (https://jira.atlassian.com/browse/CONF-32566) and it's causing a bit of customer pain.

        Is Steve Rowe's solution a viable one, or is someone working on a better one?

        Thank you!

        Show
        Edu Garcia added a comment - Hi. We've hit this bug in Atlassian Confluence ( https://jira.atlassian.com/browse/CONF-32566 ) and it's causing a bit of customer pain. Is Steve Rowe 's solution a viable one, or is someone working on a better one? Thank you!
        Hide
        ASF subversion and git services added a comment -

        Commit 1619730 from Use account "steve_rowe" instead in branch 'dev/trunk'
        [ https://svn.apache.org/r1619730 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences.

        Show
        ASF subversion and git services added a comment - Commit 1619730 from Use account "steve_rowe" instead in branch 'dev/trunk' [ https://svn.apache.org/r1619730 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences.
        Hide
        ASF subversion and git services added a comment -

        Commit 1619773 from Use account "steve_rowe" instead in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1619773 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged trunk r1619730)

        Show
        ASF subversion and git services added a comment - Commit 1619773 from Use account "steve_rowe" instead in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619773 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged trunk r1619730)
        Hide
        ASF subversion and git services added a comment -

        Commit 1619836 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1619836 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)

        Show
        ASF subversion and git services added a comment - Commit 1619836 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619836 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)
        Hide
        ASF subversion and git services added a comment -

        Commit 1619840 from Ryan Ernst in branch 'dev/trunk'
        [ https://svn.apache.org/r1619840 ]

        LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0

        Show
        ASF subversion and git services added a comment - Commit 1619840 from Ryan Ernst in branch 'dev/trunk' [ https://svn.apache.org/r1619840 ] LUCENE-5672 , LUCENE-5897 , LUCENE-5400 : move changes entry to 4.10.0
        Hide
        ASF subversion and git services added a comment -

        Commit 1619841 from Ryan Ernst in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1619841 ]

        LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0

        Show
        ASF subversion and git services added a comment - Commit 1619841 from Ryan Ernst in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619841 ] LUCENE-5672 , LUCENE-5897 , LUCENE-5400 : move changes entry to 4.10.0
        Hide
        ASF subversion and git services added a comment -

        Commit 1619842 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1619842 ]

        LUCENE-5672,LUCENE-5897,LUCENE-5400: move changes entry to 4.10.0

        Show
        ASF subversion and git services added a comment - Commit 1619842 from Ryan Ernst in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619842 ] LUCENE-5672 , LUCENE-5897 , LUCENE-5400 : move changes entry to 4.10.0
        Hide
        ASF subversion and git services added a comment -

        Commit 1625458 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9'
        [ https://svn.apache.org/r1625458 ]

        LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)

        Show
        ASF subversion and git services added a comment - Commit 1625458 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9' [ https://svn.apache.org/r1625458 ] LUCENE-5897 , LUCENE-5400 : JFlex-based tokenizers StandardTokenizer and UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of text partially matching certain grammar rules. The scanner default buffer size was reduced, and scanner buffer growth was disabled, resulting in much, much faster tokenization for these text sequences. (merged branch_4x r1619773)
        Hide
        Michael McCandless added a comment -

        Thanks for backporting Use account "steve_rowe" instead!

        Hmm, now I'm hitting this test failure on 4.9.x:

        ant test  -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStringsGraphAfter -Dtests.seed=65FB3AF41D805AF9 -Dtests.locale=mk_MK -Dtests.timezone=Etc/GMT+5 -Dtests.file.encoding=UTF-8
        
           [junit4] FAILURE 0.41s | TestStandardAnalyzer.testRandomHugeStringsGraphAfter <<<
           [junit4]    > Throwable #1: java.lang.AssertionError
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([65FB3AF41D805AF9:CA1B98C5DDF4A2CB]:0)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:751)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:614)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:513)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:437)
           [junit4]    > 	at org.apache.lucene.analysis.core.TestStandardAnalyzer.testRandomHugeStringsGraphAfter(TestStandardAnalyzer.java:402)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
           [junit4]   2> NOTE: test params are: codec=Lucene46, sim=RandomSimilarityProvider(queryNorm=false,coord=no): {}, locale=mk_MK, timezone=Etc/GMT+5
           [junit4]   2> NOTE: Linux 3.13.0-32-generic amd64/Oracle Corporation 1.7.0_55 (64-bit)/cpus=8,threads=1,free=378278472,total=503316480
           [junit4]   2> NOTE: All tests run in this JVM: [TestStandardAnalyzer]
        

        I dug just a bit... looks like we are passing len=0 to MockReaderWrapper.read(char[], int, int), which it can't handle (it calls realLen = TestUtil.nextInt(random, 1, len);) ... I'm not sure why we don't hit this on 4.x/trunk...

        Show
        Michael McCandless added a comment - Thanks for backporting Use account "steve_rowe" instead ! Hmm, now I'm hitting this test failure on 4.9.x: ant test -Dtestcase=TestStandardAnalyzer -Dtests.method=testRandomHugeStringsGraphAfter -Dtests.seed=65FB3AF41D805AF9 -Dtests.locale=mk_MK -Dtests.timezone=Etc/GMT+5 -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.41s | TestStandardAnalyzer.testRandomHugeStringsGraphAfter <<< [junit4] > Throwable #1: java.lang.AssertionError [junit4] > at __randomizedtesting.SeedInfo.seed([65FB3AF41D805AF9:CA1B98C5DDF4A2CB]:0) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:751) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:614) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:513) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:437) [junit4] > at org.apache.lucene.analysis.core.TestStandardAnalyzer.testRandomHugeStringsGraphAfter(TestStandardAnalyzer.java:402) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=Lucene46, sim=RandomSimilarityProvider(queryNorm=false,coord=no): {}, locale=mk_MK, timezone=Etc/GMT+5 [junit4] 2> NOTE: Linux 3.13.0-32-generic amd64/Oracle Corporation 1.7.0_55 (64-bit)/cpus=8,threads=1,free=378278472,total=503316480 [junit4] 2> NOTE: All tests run in this JVM: [TestStandardAnalyzer] I dug just a bit... looks like we are passing len=0 to MockReaderWrapper.read(char[], int, int), which it can't handle (it calls realLen = TestUtil.nextInt(random, 1, len); ) ... I'm not sure why we don't hit this on 4.x/trunk...
        Hide
        Michael McCandless added a comment -

        I think this "passing len=0" was fixed in 4.x/trunk by one of the JFlex upgrades? When I diff StandardTokenizerImpl.java from 4.9.x to 4.x I see this difference:

        1025,1027c523,532
        <     /* finally: fill the buffer with new input */
        <     int numRead = zzReader.read(zzBuffer, zzEndRead,
        <                                             zzBuffer.length-zzEndRead);
        ---
        >     /* fill the buffer with new input */
        >     int requested = zzBuffer.length - zzEndRead - zzFinalHighSurrogate;           
        >     int totalRead = 0;
        >     while (totalRead < requested) {
        >       int numRead = zzReader.read(zzBuffer, zzEndRead + totalRead, requested - totalRead);
        >       if (numRead == -1) {
        >         break;
        >       }
        >       totalRead += numRead;
        >     }
        

        I could "fix" this by having MockReaderWrapper.read immediately return 0 if len is 0, but this seems scary .... i.e. is there a real bug in StandardTokenizerImpl...

        Show
        Michael McCandless added a comment - I think this "passing len=0" was fixed in 4.x/trunk by one of the JFlex upgrades? When I diff StandardTokenizerImpl.java from 4.9.x to 4.x I see this difference: 1025,1027c523,532 < /* finally: fill the buffer with new input */ < int numRead = zzReader.read(zzBuffer, zzEndRead, < zzBuffer.length-zzEndRead); --- > /* fill the buffer with new input */ > int requested = zzBuffer.length - zzEndRead - zzFinalHighSurrogate; > int totalRead = 0; > while (totalRead < requested) { > int numRead = zzReader.read(zzBuffer, zzEndRead + totalRead, requested - totalRead); > if (numRead == -1) { > break; > } > totalRead += numRead; > } I could "fix" this by having MockReaderWrapper.read immediately return 0 if len is 0, but this seems scary .... i.e. is there a real bug in StandardTokenizerImpl...
        Hide
        Steve Rowe added a comment -

        Thanks for finding the bug, Michael McCandless.

        This problem doesn't exist on trunk or branch_4x because JFlex 1.6's zzRefill() doesn't call Reader.read() with len=0. It's only a problem on lucene_solr_4_9 because when I adjusted the generated scanner munging in analysis-common's run-jflex-and-disable-buffer-expansion macro to work with JFlex 1.5-generated code for the 4.9.1 backport, I didn't also modify the code to not call Reader.read() with len=0.

        I've changed the munging code locally and TestStandardAnalyzer.testRandomHugeStringsGraphAfter() now passes with the above-mentioned seed. Here's what StandardTokenizerImpl.zzRefill() has now:

        /* finally: fill the buffer with new input */
        int numRead = 0, requested = zzBuffer.length - zzEndRead;
        if (requested > 0) numRead = zzReader.read(zzBuffer, zzEndRead, requested);
        

        I'm currently beasting TestStandardAnalyzer and TestUAX29URLEmailTokenizer (no failures yet after 100 and 50 runs, respectively).

        Committing the fix shortly.

        Show
        Steve Rowe added a comment - Thanks for finding the bug, Michael McCandless . This problem doesn't exist on trunk or branch_4x because JFlex 1.6's zzRefill() doesn't call Reader.read() with len=0. It's only a problem on lucene_solr_4_9 because when I adjusted the generated scanner munging in analysis-common's run-jflex-and-disable-buffer-expansion macro to work with JFlex 1.5-generated code for the 4.9.1 backport, I didn't also modify the code to not call Reader.read() with len=0. I've changed the munging code locally and TestStandardAnalyzer.testRandomHugeStringsGraphAfter() now passes with the above-mentioned seed. Here's what StandardTokenizerImpl.zzRefill() has now: /* finally : fill the buffer with new input */ int numRead = 0, requested = zzBuffer.length - zzEndRead; if (requested > 0) numRead = zzReader.read(zzBuffer, zzEndRead, requested); I'm currently beasting TestStandardAnalyzer and TestUAX29URLEmailTokenizer (no failures yet after 100 and 50 runs, respectively). Committing the fix shortly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1625586 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9'
        [ https://svn.apache.org/r1625586 ]

        LUCENE-5897, LUCENE-5400: change JFlex-generated source munging so that zzRefill() doesn't call Reader.read(buffer,start,len) with len=0

        Show
        ASF subversion and git services added a comment - Commit 1625586 from Use account "steve_rowe" instead in branch 'dev/branches/lucene_solr_4_9' [ https://svn.apache.org/r1625586 ] LUCENE-5897 , LUCENE-5400 : change JFlex-generated source munging so that zzRefill() doesn't call Reader.read(buffer,start,len) with len=0
        Hide
        Steve Rowe added a comment -

        Committed backporting fix to lucene_solr_4_9.

        Show
        Steve Rowe added a comment - Committed backporting fix to lucene_solr_4_9.
        Hide
        Uwe Schindler added a comment -

        This fix is fine, because it spares one method call.

        But in any case the MockReader impl is wrong. You can always call Reader.read() with len=0, this is not disallowed. And all other readers support this. SO MockReader may just need an condition like if (len==0) rreturn 0;

        Show
        Uwe Schindler added a comment - This fix is fine, because it spares one method call. But in any case the MockReader impl is wrong. You can always call Reader.read() with len=0, this is not disallowed. And all other readers support this. SO MockReader may just need an condition like if (len==0) rreturn 0;
        Hide
        Michael McCandless added a comment -

        +1 to fix MockReaderWrapper.

        Show
        Michael McCandless added a comment - +1 to fix MockReaderWrapper.
        Hide
        Michael McCandless added a comment -

        But then again I sort of want to know when a Lucene tokenizer is passing len=0 ... that's ... a strange thing to be doing.

        Show
        Michael McCandless added a comment - But then again I sort of want to know when a Lucene tokenizer is passing len=0 ... that's ... a strange thing to be doing.
        Hide
        Uwe Schindler added a comment -

        Yeah, so we have to decide:

        • If MockReaderWrapper is standards conformant
        • of If we want to detect bugs
          For the latter we should keep it as it is. Maybe make it explicit and print a good message in the assert.
        Show
        Uwe Schindler added a comment - Yeah, so we have to decide: If MockReaderWrapper is standards conformant of If we want to detect bugs For the latter we should keep it as it is. Maybe make it explicit and print a good message in the assert.
        Hide
        Michael McCandless added a comment -

        +1 to make MRW "anal" and throw an exc on len==0 explaining that it's actually OK but WTF is your tokenizer doing...

        Show
        Michael McCandless added a comment - +1 to make MRW "anal" and throw an exc on len==0 explaining that it's actually OK but WTF is your tokenizer doing...
        Hide
        Michael McCandless added a comment -

        Bulk close for Lucene/Solr 4.9.1 release

        Show
        Michael McCandless added a comment - Bulk close for Lucene/Solr 4.9.1 release

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Chris Geeringh
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development