Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1398

PatternTokenizerFactory ignores offset corrections

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      I have an analyzer with a MappingCharFilterFactory followed by a PatternTokenizerFactory. This causes wrong offsets, and thus wrong highlights.

      Replacing the tokenizer with WhitespaceTokenizerFactory gives correct offsets, so I expect the problem to be with PatternTokenizerFactory.

      1. SOLR-1398.patch
        6 kB
        Koji Sekiguchi
      2. SOLR-1398.patch
        4 kB
        Koji Sekiguchi

        Activity

        Hide
        koji Koji Sekiguchi added a comment -

        Anders, thank you for reporting the problem. Can you show a concrete case so I can reproduce the problem?

        Show
        koji Koji Sekiguchi added a comment - Anders, thank you for reporting the problem. Can you show a concrete case so I can reproduce the problem?
        Hide
        andersm Anders Melchiorsen added a comment -

        I used this slightly modified configuration from the example config:

        <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
        <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[,;/\s]+" />
        </analyzer>
        </fieldType>

        with the file mapping.txt containing just:

        "& uuml;" => "ü"

        and analyzing the string "G& uuml;nther G& uuml;nther is here" with analysis.jsp (with verbose output) gives offsets:

        5,12 13,20 21,23 24,28

        while they should be:

        0,12 13,25 26,28 29,33

        (Note, I had to split the HTML entity into two parts to have it display in JIRA)

        Show
        andersm Anders Melchiorsen added a comment - I used this slightly modified configuration from the example config: <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" > <analyzer> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/> <tokenizer class="solr.PatternTokenizerFactory" pattern=" [,;/\s] +" /> </analyzer> </fieldType> with the file mapping.txt containing just: "& uuml;" => "ü" and analyzing the string "G& uuml;nther G& uuml;nther is here" with analysis.jsp (with verbose output) gives offsets: 5,12 13,20 21,23 24,28 while they should be: 0,12 13,25 26,28 29,33 (Note, I had to split the HTML entity into two parts to have it display in JIRA)
        Hide
        koji Koji Sekiguchi added a comment -

        Anders, can you apply the patch and see the highlighted result?

        Show
        koji Koji Sekiguchi added a comment - Anders, can you apply the patch and see the highlighted result?
        Hide
        andersm Anders Melchiorsen added a comment -

        Thanks. The patch appears to work, in that analysis.jsp now gives correct results. However, I am still not able to get highlights in my actual application, due to the below error. There is the same problem with the WhitespaceTokenizer.

        I guess that this is a separate issue, where the highlighter is also not using offset corrections. Would you mind opening a ticket for that issue, as I am not quite sure what to put into it, or where to put it.

        SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:328)
        at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
        at org.mortbay.jetty.Server.handle(Server.java:285)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
        at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
        Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803
        at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:254)
        at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:321)
        ... 23 more

        Show
        andersm Anders Melchiorsen added a comment - Thanks. The patch appears to work, in that analysis.jsp now gives correct results. However, I am still not able to get highlights in my actual application, due to the below error. There is the same problem with the WhitespaceTokenizer. I guess that this is a separate issue, where the highlighter is also not using offset corrections. Would you mind opening a ticket for that issue, as I am not quite sure what to put into it, or where to put it. SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803 at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:328) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token teknisk exceeds length of provided text sized 803 at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:254) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:321) ... 23 more
        Hide
        koji Koji Sekiguchi added a comment -

        Anders, thank you for testing the patch and reporting the result. Yes, I think the error is a separate issue. Can you show the procedure (schema.xml, indexed data and request parameters) to reproduce the error? I tried to index "G& uuml;nther G& uuml;nther is here" and search "Günther", but I could get a highlighted result successfully.

        Show
        koji Koji Sekiguchi added a comment - Anders, thank you for testing the patch and reporting the result. Yes, I think the error is a separate issue. Can you show the procedure (schema.xml, indexed data and request parameters) to reproduce the error? I tried to index "G& uuml;nther G& uuml;nther is here" and search "Günther", but I could get a highlighted result successfully.
        Hide
        andersm Anders Melchiorsen added a comment -

        Koji, let us not mix up things. I will create a new ticket for that error once I figure out how to reproduce it reliably.

        Show
        andersm Anders Melchiorsen added a comment - Koji, let us not mix up things. I will create a new ticket for that error once I figure out how to reproduce it reliably.
        Hide
        andersm Anders Melchiorsen added a comment -

        I created SOLR-1404 for the above error. From my point of view, the PatternTokenizerFactory issue that the present ticket is about is resolved with the patch from Koji.

        Show
        andersm Anders Melchiorsen added a comment - I created SOLR-1404 for the above error. From my point of view, the PatternTokenizerFactory issue that the present ticket is about is resolved with the patch from Koji.
        Hide
        koji Koji Sekiguchi added a comment -

        a new patch with test case. Will commit shortly.

        Show
        koji Koji Sekiguchi added a comment - a new patch with test case. Will commit shortly.
        Hide
        koji Koji Sekiguchi added a comment -

        Committed revision 811753.

        Show
        koji Koji Sekiguchi added a comment - Committed revision 811753.
        Hide
        gsingers Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        gsingers Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            koji Koji Sekiguchi
            Reporter:
            andersm Anders Melchiorsen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development