Solr
  1. Solr
  2. SOLR-2891

InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

    Details

      Description

      I would like to handle german accents (Umlaute) by replacing the accented char with its two-letter substitute (e.g ä => ae). For this reason I use the char-filter solr.MappingCharFilterFactory configured with a mapping file containing entries like "ä" => "ae". I also want to use the solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of compound words (e.g. revision in totalrevision). And finally I want to use Solr highlighting. But there seems to be a problem if I combine the char filter and the compound word filter in combination with highlighting (an org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised).

      Here are the details:

      types:
      --------
      <fieldType name="textAnalyzedFailed" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
      <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="words.txt"/>
      </analyzer>
      </fieldType>

      schema:
      -----------
      <fields>
      <field name="id" type="string" indexed="true" stored="true" required="true" />
      <field name="title" type="textAnalyzedFailed" indexed="true" stored="true"/>
      </fields>

      document:
      --------------
      <doc>
      <field name="id">1</field>
      <field name="title">banküberfall</field>
      </doc>

      mapping.txt:
      -----------------
      "ü" => "ue"

      words.txt:
      --------------
      fall

      The resulting error when search with:

      http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title

      Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute
      INFO: [] webapp=/solr path=/select/ params=

      {q=bank?berfall&hl.fl=title_hl&hl=true}

      hits=1 status=0 QTime=13
      Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log
      SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
      at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
      at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
      at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
      at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
      at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
      at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
      at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
      at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
      at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
      at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
      at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
      at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
      at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
      at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
      at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      at java.lang.Thread.run(Thread.java:680)
      Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
      at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
      at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
      ... 23 more

      The analysis tool says the following for field name=title, field value=banküberfall:
      ------------------------------------------------------------------------------------
      Index Analyzer
      org.apache.solr.analysis.MappingCharFilterFactory

      {mapping=mapping.txt, luceneMatchVersion=LUCENE_31}

      text bankueberfall
      org.apache.solr.analysis.WhitespaceTokenizerFactory

      {luceneMatchVersion=LUCENE_31}

      position 1
      term text bankueberfall
      startOffset 0
      endOffset 12
      org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory

      {dictionary=words.txt, luceneMatchVersion=LUCENE_31}

      position 1
      term text bankueberfall
      fall
      startOffset 0
      9
      endOffset 12
      13
      flags 0
      0
      type word
      word
      payload

        Activity

        Hide
        Vadim Kisselmann added a comment -

        it´s an old bug. I have big problems too with OffsetExceptions when i use
        Highlighting, or Carrot.
        It looks like a problem with HTMLStripCharFilter.
        Patch doesn´t work.

        https://issues.apache.org/jira/browse/LUCENE-2208

        Show
        Vadim Kisselmann added a comment - it´s an old bug. I have big problems too with OffsetExceptions when i use Highlighting, or Carrot. It looks like a problem with HTMLStripCharFilter. Patch doesn´t work. https://issues.apache.org/jira/browse/LUCENE-2208
        Hide
        Robert Muir added a comment -

        The problem is CompoundWordTokenFilter has the same bugs as LUCENE-3642. There was a note in the source code (I think noted by Uwe):

        // TODO: This ignores the original endOffset, if a CharFilter/Tokenizer/Filter removed
        // chars from the term, offsets may not match correctly (other filters producing tokens
        // may also have this problem):
        

        Edwin: thanks for providing good information, i turned this into a test and fixed it the same way as LUCENE-3642.

        Show
        Robert Muir added a comment - The problem is CompoundWordTokenFilter has the same bugs as LUCENE-3642 . There was a note in the source code (I think noted by Uwe): // TODO: This ignores the original endOffset, if a CharFilter/Tokenizer/Filter removed // chars from the term, offsets may not match correctly (other filters producing tokens // may also have this problem): Edwin: thanks for providing good information, i turned this into a test and fixed it the same way as LUCENE-3642 .
        Hide
        Robert Muir added a comment -

        Thanks Edwin!

        Show
        Robert Muir added a comment - Thanks Edwin!

          People

          • Assignee:
            Robert Muir
            Reporter:
            Edwin Steiner
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development