Solr
  1. Solr
  2. SOLR-1630

StringIndexOutOfBoundsException in SpellCheckComponent

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Labels:
      None
    • Environment:

      Solr 1.4
      Lucene 2.9.1
      Win XP
      java version "1.6.0_14"

      Description

      For some documents/search strings, the SpellCheckComponent throws StringIndexOutOfBoundsException
      See: http://www.lucidimagination.com/search/document/3be6555227e031fc/

      Replication

      It throws a StringIndexOutOfBoundsException

       String index out of range: -7
      
      java.lang.StringIndexOutOfBoundsException: String index out of range: -7
      	at java.lang.AbstractStringBuilder.replace(Unknown Source)
      	at java.lang.StringBuilder.replace(Unknown Source)
      	at org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248)
      	at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143)
      	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
      	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
      	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
      	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
      	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
      
      1. bug.xml
        0.1 kB
        Robin Wojciki
      2. schema.xml
        4 kB
        Robin Wojciki
      3. SOLR-1630.patch
        5 kB
        Robert Muir
      4. SOLR-1630.patch
        4 kB
        Shalin Shekhar Mangar
      5. solrconfig.xml
        13 kB
        Robin Wojciki
      6. spellcheckconfig.xml
        3 kB
        Guillaume Lebourgeois

        Activity

        Hide
        Robert Muir added a comment -

        I'm fine w/ the fix, but it should be noted that we say in http://wiki.apache.org/solr/SpellCheckingAnalysis, which is linked from the http://wiki.apache.org/solr/SpellCheckComponent#Spell_Checking_Analysis page that the analysis used for spelling should be dead simple, even stemming is not recommended.

        Thanks Grant, that was the gist of my concern with the fix. It didn't seem obvious to me what the "correct" behavior of collation should be with this sort of analysis, but this answers my question.

        At least we have the "defensive" measure in case someone configures it like this, thinks its working for the most part, but then a user enters a hyphen.

        Show
        Robert Muir added a comment - I'm fine w/ the fix, but it should be noted that we say in http://wiki.apache.org/solr/SpellCheckingAnalysis , which is linked from the http://wiki.apache.org/solr/SpellCheckComponent#Spell_Checking_Analysis page that the analysis used for spelling should be dead simple, even stemming is not recommended. Thanks Grant, that was the gist of my concern with the fix. It didn't seem obvious to me what the "correct" behavior of collation should be with this sort of analysis, but this answers my question. At least we have the "defensive" measure in case someone configures it like this, thinks its working for the most part, but then a user enters a hyphen.
        Hide
        Grant Ingersoll added a comment -

        Trunk: Committed revision 987509.

        3.x: Committed revision 987511.

        Show
        Grant Ingersoll added a comment - Trunk: Committed revision 987509. 3.x: Committed revision 987511.
        Hide
        Grant Ingersoll added a comment -

        I'm fine w/ the fix, but it should be noted that we say in http://wiki.apache.org/solr/SpellCheckingAnalysis, which is linked from the http://wiki.apache.org/solr/SpellCheckComponent#Spell_Checking_Analysis page that the analysis used for spelling should be dead simple, even stemming is not recommended.

        Show
        Grant Ingersoll added a comment - I'm fine w/ the fix, but it should be noted that we say in http://wiki.apache.org/solr/SpellCheckingAnalysis , which is linked from the http://wiki.apache.org/solr/SpellCheckComponent#Spell_Checking_Analysis page that the analysis used for spelling should be dead simple, even stemming is not recommended.
        Hide
        Robert Muir added a comment -

        attached is a patch with a testcase for the issue (and maybe a fix/workaround, though it still doesnt seem really right to me and i dont completely understand what this spellcheck collate should do)

        I was able to trigger it easily by using synonymfilter + wdf at querytime, which i know will muck with the offsets.

        Show
        Robert Muir added a comment - attached is a patch with a testcase for the issue (and maybe a fix/workaround, though it still doesnt seem really right to me and i dont completely understand what this spellcheck collate should do) I was able to trigger it easily by using synonymfilter + wdf at querytime, which i know will muck with the offsets.
        Hide
        Khaled Hammouda added a comment -

        We just hit this bug as well. To reproduce, you must index a document that contains a hyphen (or underscore) and then search with a misspelled version of the indexed text; e.g.

        document contains: mid-term
        query: mis-term
        result: exception thrown

        I looked at the code of where this is happening and it seems to be related to token offsets (of the tokenized query) in conjunction with a feature of the spellcheck component called collation. Basically collation tries to replace the original query with the top suggested words. It relies on the tokenizer to remove the original misspelled words and insert the suggested ones (using StringBuilder.replace). Unfortunately the token offsets look weird for words with hyphens (or underscore); for example:

        query: abc_def
        1st token: value = abc; startOffset = 0; endOffset = 7
        2nd token: value = def; startOffset = 0; endOffset = 7

        Because the two tokens occupy the same range (0-7) this messes up the replacement logic. I'm not sure if this tokenizer behavior is the correct one, but it's part of the problem.

        Having said that, I tried to change the spellcheck tokenizer from standard to whitespace and this actually solved the problem; no errors and I get correct suggestions.

        So, until this gets fixed you can either:

        1) Disable spellchecker collation, or
        2) Use a whitespace tokenizer for the spellchecker component

        Show
        Khaled Hammouda added a comment - We just hit this bug as well. To reproduce, you must index a document that contains a hyphen (or underscore) and then search with a misspelled version of the indexed text; e.g. document contains: mid-term query: mis-term result: exception thrown I looked at the code of where this is happening and it seems to be related to token offsets (of the tokenized query) in conjunction with a feature of the spellcheck component called collation. Basically collation tries to replace the original query with the top suggested words. It relies on the tokenizer to remove the original misspelled words and insert the suggested ones (using StringBuilder.replace). Unfortunately the token offsets look weird for words with hyphens (or underscore); for example: query: abc_def 1st token: value = abc; startOffset = 0; endOffset = 7 2nd token: value = def; startOffset = 0; endOffset = 7 Because the two tokens occupy the same range (0-7) this messes up the replacement logic. I'm not sure if this tokenizer behavior is the correct one, but it's part of the problem. Having said that, I tried to change the spellcheck tokenizer from standard to whitespace and this actually solved the problem; no errors and I get correct suggestions. So, until this gets fixed you can either: 1) Disable spellchecker collation, or 2) Use a whitespace tokenizer for the spellchecker component
        Hide
        Graham P added a comment - - edited

        We get the same StringIndexOutOfBoundsException with hyphenated query strings, on production and development - Solr 1.4.0 on CentOS 5.4 x86_64 using 32-bit Java 1.6.0_12

        Show
        Graham P added a comment - - edited We get the same StringIndexOutOfBoundsException with hyphenated query strings, on production and development - Solr 1.4.0 on CentOS 5.4 x86_64 using 32-bit Java 1.6.0_12
        Hide
        Jay Hill added a comment -

        I have seen another case of a production system hitting this exact same exception. However I'm unable to reproduce it outside of production. However it is occurring on all queries with hyphenated words. For a search on: ochoa-brillembourg

        SEVERE: java.lang.StringIndexOutOfBoundsException: String index out of range: -14
        at java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:797)
        at java.lang.StringBuilder.replace(StringBuilder.java:271)
        at org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248)
        at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
        at java.lang.Thread.run(Thread.java:619)

        Show
        Jay Hill added a comment - I have seen another case of a production system hitting this exact same exception. However I'm unable to reproduce it outside of production. However it is occurring on all queries with hyphenated words. For a search on: ochoa-brillembourg SEVERE: java.lang.StringIndexOutOfBoundsException: String index out of range: -14 at java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:797) at java.lang.StringBuilder.replace(StringBuilder.java:271) at org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Thread.java:619)
        Hide
        Ralf Kraus added a comment -

        We have found an hint to the problem:

        We run into into this problem ONLY when the search result includes words with "-" in it!
        For example "t-bone-steak".

        When I search with "t-bone-steak" the error occurs every time !

        I hope we could help!

        Show
        Ralf Kraus added a comment - We have found an hint to the problem: We run into into this problem ONLY when the search result includes words with "-" in it! For example "t-bone-steak". When I search with "t-bone-steak" the error occurs every time ! I hope we could help!
        Hide
        Guillaume Lebourgeois added a comment -

        I've been trying to reproduce the bug with a "one-document index", but I fail... on the other hand, on index of 500k+ documents this issue is "automatic". Maybe it's linked with some kinds of documents ? I don't know, I'm gonna test some other possibilities in case it can help.

        Show
        Guillaume Lebourgeois added a comment - I've been trying to reproduce the bug with a "one-document index", but I fail... on the other hand, on index of 500k+ documents this issue is "automatic". Maybe it's linked with some kinds of documents ? I don't know, I'm gonna test some other possibilities in case it can help.
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Guillaume, can you give me an example document too?

        Show
        Shalin Shekhar Mangar added a comment - Thanks Guillaume, can you give me an example document too?
        Hide
        Guillaume Lebourgeois added a comment - - edited

        This file provide a spellcheck configuration and a requesthandler which may raise an exception when making queries

        Example of queries which work fine :

        • ?q=test
        • ?q=my+name+is+henry
        • ?q=éléphant

        Example of queries which throw an exception :

        • ?q=sous-marin
        • ?q=sous-marin+russe
        • ?q=sous_marin
        • ?q=éléphant-blanc

        It may be linked to the content of the index, and/or the spellcheck index.

        Here is the stack :

        at java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:797)
        at java.lang.StringBuilder.replace(StringBuilder.java:271)
        at org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248)
        at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

        Show
        Guillaume Lebourgeois added a comment - - edited This file provide a spellcheck configuration and a requesthandler which may raise an exception when making queries Example of queries which work fine : ?q=test ?q=my+name+is+henry ?q=éléphant Example of queries which throw an exception : ?q=sous-marin ?q=sous-marin+russe ?q=sous_marin ?q=éléphant-blanc It may be linked to the content of the index, and/or the spellcheck index. Here is the stack : at java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:797) at java.lang.StringBuilder.replace(StringBuilder.java:271) at org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
        Hide
        Guillaume Lebourgeois added a comment -

        Ok, i'lm gonna try to upload my own config in case it can help.

        Show
        Guillaume Lebourgeois added a comment - Ok, i'lm gonna try to upload my own config in case it can help.
        Hide
        Shalin Shekhar Mangar added a comment -

        I'm not able to reproduce this issue. I used Robin's document, schema and solrconfig.xml in the form of a unit test and it gives an empty spell check response but no exceptions.

        Show
        Shalin Shekhar Mangar added a comment - I'm not able to reproduce this issue. I used Robin's document, schema and solrconfig.xml in the form of a unit test and it gives an empty spell check response but no exceptions.
        Hide
        Guillaume Lebourgeois added a comment -

        Hi, it seems I've encountered the same bug. All queries using the - char, or the _ char make solr throw an exception when using the SpellCheckComponent. It is possible to temporary fix it by setting accuracy parameter to 1.0 (which makes the spellcheck pretty useless, but avoid exceptions).

        Show
        Guillaume Lebourgeois added a comment - Hi, it seems I've encountered the same bug. All queries using the - char, or the _ char make solr throw an exception when using the SpellCheckComponent. It is possible to temporary fix it by setting accuracy parameter to 1.0 (which makes the spellcheck pretty useless, but avoid exceptions).
        Hide
        Robin Wojciki added a comment -

        Solr config and Solr doc for replicating the issue

        Show
        Robin Wojciki added a comment - Solr config and Solr doc for replicating the issue

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Robin Wojciki
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development