Solr
  1. Solr
  2. SOLR-1398

PatternTokenizerFactory ignores offset corrections

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      I have an analyzer with a MappingCharFilterFactory followed by a PatternTokenizerFactory. This causes wrong offsets, and thus wrong highlights.

      Replacing the tokenizer with WhitespaceTokenizerFactory gives correct offsets, so I expect the problem to be with PatternTokenizerFactory.

      1. SOLR-1398.patch
        4 kB
        Koji Sekiguchi
      2. SOLR-1398.patch
        6 kB
        Koji Sekiguchi

        Activity

        Hide
        Koji Sekiguchi added a comment -

        Anders, thank you for reporting the problem. Can you show a concrete case so I can reproduce the problem?

        Show
        Koji Sekiguchi added a comment - Anders, thank you for reporting the problem. Can you show a concrete case so I can reproduce the problem?
        Hide
        Anders Melchiorsen added a comment -

        I used this slightly modified configuration from the example config:

        <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
        <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[,;/\s]+" />
        </analyzer>
        </fieldType>

        with the file mapping.txt containing just:

        "& uuml;" => "ü"

        and analyzing the string "G& uuml;nther G& uuml;nther is here" with analysis.jsp (with verbose output) gives offsets:

        5,12 13,20 21,23 24,28

        while they should be:

        0,12 13,25 26,28 29,33

        (Note, I had to split the HTML entity into two parts to have it display in JIRA)

        Show
        Anders Melchiorsen added a comment - I used this slightly modified configuration from the example config: <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" > <analyzer> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/> <tokenizer class="solr.PatternTokenizerFactory" pattern=" [,;/\s] +" /> </analyzer> </fieldType> with the file mapping.txt containing just: "& uuml;" => "ü" and analyzing the string "G& uuml;nther G& uuml;nther is here" with analysis.jsp (with verbose output) gives offsets: 5,12 13,20 21,23 24,28 while they should be: 0,12 13,25 26,28 29,33 (Note, I had to split the HTML entity into two parts to have it display in JIRA)
        Hide
        Koji Sekiguchi added a comment -

        Anders, can you apply the patch and see the highlighted result?

        Show
        Koji Sekiguchi added a comment - Anders, can you apply the patch and see the highlighted result?
        Hide
        Anders Melchiorsen added a comment -

        Thanks. The patch appears to work, in that analysis.jsp now gives correct results. However, I am still not able to get highlights in my actual application, due to the below error. There is the same problem with the WhitespaceTokenizer.

        I guess that this is a separate issue, where the highlighter is also not using offset corrections. Would you mind opening a ticket for that issue, as I am not quite sure what to put into it, or where to put it.

        SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:328)
        at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
        Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803
        at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:254)
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:321)
        ... 23 more

        Show
        Anders Melchiorsen added a comment - Thanks. The patch appears to work, in that analysis.jsp now gives correct results. However, I am still not able to get highlights in my actual application, due to the below error. There is the same problem with the WhitespaceTokenizer. I guess that this is a separate issue, where the highlighter is also not using offset corrections. Would you mind opening a ticket for that issue, as I am not quite sure what to put into it, or where to put it. SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803 at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:328) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803 at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:254) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:321) ... 23 more
        Hide
        Koji Sekiguchi added a comment -

        Anders, thank you for testing the patch and reporting the result. Yes, I think the error is a separate issue. Can you show the procedure (schema.xml, indexed data and request parameters) to reproduce the error? I tried to index "G& uuml;nther G& uuml;nther is here" and search "Günther", but I could get a highlighted result successfully.

        Show
        Koji Sekiguchi added a comment - Anders, thank you for testing the patch and reporting the result. Yes, I think the error is a separate issue. Can you show the procedure (schema.xml, indexed data and request parameters) to reproduce the error? I tried to index "G& uuml;nther G& uuml;nther is here" and search "Günther", but I could get a highlighted result successfully.
        Hide
        Anders Melchiorsen added a comment -

        Koji, let us not mix up things. I will create a new ticket for that error once I figure out how to reproduce it reliably.

        Show
        Anders Melchiorsen added a comment - Koji, let us not mix up things. I will create a new ticket for that error once I figure out how to reproduce it reliably.
        Hide
        Anders Melchiorsen added a comment -

        I created SOLR-1404 for the above error. From my point of view, the PatternTokenizerFactory issue that the present ticket is about is resolved with the patch from Koji.

        Show
        Anders Melchiorsen added a comment - I created SOLR-1404 for the above error. From my point of view, the PatternTokenizerFactory issue that the present ticket is about is resolved with the patch from Koji.
        Hide
        Koji Sekiguchi added a comment -

        a new patch with test case. Will commit shortly.

        Show
        Koji Sekiguchi added a comment - a new patch with test case. Will commit shortly.
        Hide
        Koji Sekiguchi added a comment -

        Committed revision 811753.

        Show
        Koji Sekiguchi added a comment - Committed revision 811753.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Anders Melchiorsen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development