Solr
  1. Solr
  2. SOLR-331

StringIndexOutOfBoundsException when using synonyms and highlighting

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2
    • Fix Version/s: 1.2
    • Component/s: search
    • Labels:
      None
    • Environment:

      JBOSS 4.0.5.GA
      SOLR 1.2 binary release

      Description

      When searching index using highlighting and synonyms we get the following exception:

      java.lang.StringIndexOutOfBoundsException: String index out of range: 42
      at java.lang.String.substring(String.java:1935)
      at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:271)
      at org.apache.solr.util.HighlightingUtils.doHighlighting(HighlightingUtils.java:266)
      at org.apache.solr.request.StandardRequestHandler.handleRequest(StandardRequestHandler.java:164)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:595)
      at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:697)
      at javax.servlet.http.HttpServlet.service(HttpServlet.java:810)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
      at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
      at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
      at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
      at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:175)
      at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:74)
      at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
      at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
      at org.jboss.web.tomcat.tc5.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:156)
      at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
      at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
      at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
      at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
      at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
      at org.apache.tomcat.util.net.MasterSlaveWorkerThread.run(MasterSlaveWorkerThread.java:112)
      at java.lang.Thread.run(Thread.java:619)

      the problem is reproduceable and permanent with the attached files to this issue. Just use SOLR internal
      "Make a Query[Full Interface]" Admin Interface option and search for:
      Statement: adhs
      Enable Highlighting: X
      Fields to Highlight: content

      e.g.
      http://127.0.0.1:8080/solr/select?indent=on&version=2.2&q=adhs&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=content

      Thank you in advance!

      Oliver

      1. adhs_small.xml
        0.2 kB
        Oliver Kuhn
      2. schema.xml
        13 kB
        Oliver Kuhn
      3. synonyms.txt
        0.7 kB
        Oliver Kuhn
      4. WordDelimiterFilter.patch
        3 kB
        Yonik Seeley

        Activity

        Hide
        Oliver Kuhn added a comment -

        to create index with our data we used:

        curl http://127.0.0.1:8080/solr/update --data-binary @adhs_small.xml -H "Content-type:text/xml; charset=utf-8"
        curl http://127.0.0.1:8080/solr/update --data-binary "<commit />"

        adhs_small.xml contains our content.

        Show
        Oliver Kuhn added a comment - to create index with our data we used: curl http://127.0.0.1:8080/solr/update --data-binary @adhs_small.xml -H "Content-type:text/xml; charset=utf-8" curl http://127.0.0.1:8080/solr/update --data-binary "<commit />" adhs_small.xml contains our content.
        Hide
        Yonik Seeley added a comment -

        Thanks for the very clear report.
        It looks like WordDelimiterFilter is messing up the offsets.

        http://localhost:8983/solr/admin/analysis.jsp?nt=name&name=content&highlight=on&val=&qverbose=on&qval=dummy+dummy+dummy+ADHS+dummy

        As a side note, you probably don't want to be expanding the same synonym list at both index and query time. And expanding multi-word synonyms (of differing numbers of tokens) at query time doesn't really work... see the wiki for details on that.

        Show
        Yonik Seeley added a comment - Thanks for the very clear report. It looks like WordDelimiterFilter is messing up the offsets. http://localhost:8983/solr/admin/analysis.jsp?nt=name&name=content&highlight=on&val=&qverbose=on&qval=dummy+dummy+dummy+ADHS+dummy As a side note, you probably don't want to be expanding the same synonym list at both index and query time. And expanding multi-word synonyms (of differing numbers of tokens) at query time doesn't really work... see the wiki for details on that.
        Hide
        Yonik Seeley added a comment -

        OK, the issue is the offsets for subwords. For example, WordDelimiterFilter is looking at
        Aufmerksamkeits-Defizite, and it doesn't know that it's not part of the original text, so when it creates the sub-token
        Aufmerksamkeits, it sets the offsets to match the length of "Aufmerksamkeits" (which is bigger than the original).

        Some ideas to fix:

        • Make sure that generated offsets never exceed original offsets (this we should always do). That will eliminate the exception, but generate highlighting mismatches in some cases.
        • try to recognize non-original tokens like synonyms
          1) by token type??? but not everyone uses that
          2) by seeing that the offsets don't match the token... stemming would cause this, but stemming should always be after WordDelimiterFilter. Anything else that would cause "length != end-start"?
        Show
        Yonik Seeley added a comment - OK, the issue is the offsets for subwords. For example, WordDelimiterFilter is looking at Aufmerksamkeits-Defizite, and it doesn't know that it's not part of the original text, so when it creates the sub-token Aufmerksamkeits, it sets the offsets to match the length of "Aufmerksamkeits" (which is bigger than the original). Some ideas to fix: Make sure that generated offsets never exceed original offsets (this we should always do). That will eliminate the exception, but generate highlighting mismatches in some cases. try to recognize non-original tokens like synonyms 1) by token type??? but not everyone uses that 2) by seeing that the offsets don't match the token... stemming would cause this, but stemming should always be after WordDelimiterFilter. Anything else that would cause "length != end-start"?
        Hide
        Yonik Seeley added a comment -

        This patch seems to work fine (not adjusting offsets if they don't match the original token length)

        Show
        Yonik Seeley added a comment - This patch seems to work fine (not adjusting offsets if they don't match the original token length)
        Hide
        Yonik Seeley added a comment -

        committed.

        Show
        Yonik Seeley added a comment - committed.
        Hide
        Oliver Kuhn added a comment -

        Hi Yonik, thanks for the quick answer.. we will try this patch in the near future!

        Show
        Oliver Kuhn added a comment - Hi Yonik, thanks for the quick answer.. we will try this patch in the near future!

          People

          • Assignee:
            Unassigned
            Reporter:
            Oliver Kuhn
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development