Solr
  1. Solr
  2. SOLR-5426

org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 0 exceeds length of provided text sized 840

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.4, 4.5.1
    • Fix Version/s: 4.9, 6.0
    • Component/s: highlighter
    • Labels:
      None

      Description

      Highlighter does not work correctly on test-data.
      I added index- and config- files (see attached highlighter.zip) for reproducing this issue.
      Everything works fine if I search without highlighting:

      http://localhost:8983/solr/global/select?q=aa&wt=json&indent=true

      But if search with highlighting:

      http://localhost:8983/solr/global/select?q=aa&wt=json&indent=true&hl=true&hl.fl=*_stx&hl.simple.pre=<em>&hl.simple.post=<%2Fem>

      I'm get the error:

      ERROR - 2013-11-07 10:17:15.797; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 0 exceeds length of provided text sized 840
      at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:542)
      at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:414)
      at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:139)
      at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
      at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
      at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
      at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
      at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
      at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
      at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
      at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
      at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
      at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
      at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
      at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
      at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
      at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
      at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
      at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
      at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
      at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
      at org.eclipse.jetty.server.Server.handle(Server.java:368)
      at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
      at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
      at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
      at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
      at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
      at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
      at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
      at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
      at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
      at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
      at java.lang.Thread.run(Unknown Source)
      Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token 0 exceeds length of provided text sized 840
      at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:225)
      at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:527)
      ... 33 more

      1. highlighter.zip
        126 kB
        Nikolay
      2. SOLR-5426.patch
        33 kB
        Uwe Schindler
      3. SOLR-5426.patch
        32 kB
        Hoss Man
      4. SOLR-5426.patch
        33 kB
        Arun Kumar
      5. SOLR-5426.patch
        32 kB
        Ahmet Arslan
      6. SOLR-5426.patch
        48 kB
        Ahmet Arslan
      7. SOLR-5426.patch
        98 kB
        Arun Kumar

        Issue Links

          Activity

          Hide
          Nikolay added a comment -

          data for reproduce the bug, folder "global" contains lucene index, also there are 2 config files (schema.xml, solrconfig.xml)

          Show
          Nikolay added a comment - data for reproduce the bug, folder "global" contains lucene index, also there are 2 config files (schema.xml, solrconfig.xml)
          Hide
          Arun Kumar added a comment -

          I did investigate this issue and found that it is a desired behavior of the highlighter. The default chars to analyze for highlighter is hard coded and set to 51200 chars max.

          DEFAULT_MAX_CHARS_TO_ANALYZE = 50*1024;

          But the document you indexed has more characters to analyze in one of the field value that causes this issue. But there is a way to increase this max chars to analyze by sending an additional query param like this hl.maxAnalyzedChars=52300

          So if I fire a query as below then the error is not seen.
          http://localhost:8983/solr/global/select?q=aa&indent=true&hl=true&hl.fl=*_stx&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C/em%3E&hl.maxAnalyzedChars=52300

          Hope this helps.

          Show
          Arun Kumar added a comment - I did investigate this issue and found that it is a desired behavior of the highlighter. The default chars to analyze for highlighter is hard coded and set to 51200 chars max. DEFAULT_MAX_CHARS_TO_ANALYZE = 50*1024; But the document you indexed has more characters to analyze in one of the field value that causes this issue. But there is a way to increase this max chars to analyze by sending an additional query param like this hl.maxAnalyzedChars=52300 So if I fire a query as below then the error is not seen. http://localhost:8983/solr/global/select?q=aa&indent=true&hl=true&hl.fl=*_stx&hl.simple.pre=%3Cem%3E&hl.simple.post=%3C/em%3E&hl.maxAnalyzedChars=52300 Hope this helps.
          Hide
          Arun Kumar added a comment -

          disregard my previous comments, Even if a string exists the max offset limit it shouldn't blow up with an exception. On further investigation I found that an unused token of the larger string when it token count increase the max offset limit carries to the next available string in the loop. Attached a patch file for the fix, it resolves the issue.

          Show
          Arun Kumar added a comment - disregard my previous comments, Even if a string exists the max offset limit it shouldn't blow up with an exception. On further investigation I found that an unused token of the larger string when it token count increase the max offset limit carries to the next available string in the loop. Attached a patch file for the fix, it resolves the issue.
          Hide
          Hoss Man added a comment -

          Arun: Since you seem to have a grasp on the problem here, would it be possible for you to help write a unit test to recreate it?

          Show
          Hoss Man added a comment - Arun: Since you seem to have a grasp on the problem here, would it be possible for you to help write a unit test to recreate it?
          Hide
          Hoss Man added a comment -

          Also: when submitting patches, it's really helpful if you can please generate the patch against the entire code base...

          https://wiki.apache.org/solr/HowToContribute#Generating_a_patch

          Show
          Hoss Man added a comment - Also: when submitting patches, it's really helpful if you can please generate the patch against the entire code base... https://wiki.apache.org/solr/HowToContribute#Generating_a_patch
          Hide
          Arun Kumar added a comment -

          Patch generated at the project root level.

          Show
          Arun Kumar added a comment - Patch generated at the project root level.
          Hide
          Arun Kumar added a comment -

          Hi Hoss,

          Thanks for your review, I have updated the patch which is generated against the entire code base. I tried to create an unit test to recreate it but couldn't do that successfully as this is reproducible in combination of CachingTokenFilter along with OffsetLimitTokenFilter.

          Thanks,
          Arun

          Show
          Arun Kumar added a comment - Hi Hoss, Thanks for your review, I have updated the patch which is generated against the entire code base. I tried to create an unit test to recreate it but couldn't do that successfully as this is reproducible in combination of CachingTokenFilter along with OffsetLimitTokenFilter. Thanks, Arun
          Hide
          Steve Rowe added a comment -

          Arun, what do you mean when you say the following?

          I tried to create an unit test to recreate it but couldn't do that successfully as this is reproducible in combination of CachingTokenFilter along with OffsetLimitTokenFilter.

          I don't understand why these token filters need to be involved? I looked at the schema.xml in the .zip attachment, and the *_stx fields' type text_stx doesn't use those filters.

          The patch you attached can't be committed without a unit test that fails without the patch and succeeds with it.

          Show
          Steve Rowe added a comment - Arun, what do you mean when you say the following? I tried to create an unit test to recreate it but couldn't do that successfully as this is reproducible in combination of CachingTokenFilter along with OffsetLimitTokenFilter. I don't understand why these token filters need to be involved? I looked at the schema.xml in the .zip attachment, and the *_stx fields' type text_stx doesn't use those filters. The patch you attached can't be committed without a unit test that fails without the patch and succeeds with it.
          Hide
          Arun Kumar added a comment -

          Steve, thanks for reviewing my changes. All these token filters are used by the highlighter component. I have updated the patch with a unit test which helps us to reproduce the issue and the code change in the patch fixes it.

          Show
          Arun Kumar added a comment - Steve, thanks for reviewing my changes. All these token filters are used by the highlighter component. I have updated the patch with a unit test which helps us to reproduce the issue and the code change in the patch fixes it.
          Hide
          Ahmet Arslan added a comment -

          Bring failing test case to trunk. Simplify schema and add 5 test methods to isolate problem. Problematic field type is given below. Exception occurs for only stored and multiValued field. TestCase demonstrates this.

          Another interesting thing is test passes

          • when WordDelimiterFilterFactory is removed from index analyzer
          • when ReversedWildcardFilterFactory is removed from index analyzer
            separately.
           <fieldType name="text_stx" class="solr.TextField" positionIncrementGap="100">
                <analyzer type="index">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                  <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
                     maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
                </analyzer>
                <analyzer type="query">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                </analyzer>
              </fieldType>
          
          Show
          Ahmet Arslan added a comment - Bring failing test case to trunk. Simplify schema and add 5 test methods to isolate problem. Problematic field type is given below. Exception occurs for only stored and multiValued field. TestCase demonstrates this. Another interesting thing is test passes when WordDelimiterFilterFactory is removed from index analyzer when ReversedWildcardFilterFactory is removed from index analyzer separately. <fieldType name= "text_stx" class= "solr.TextField" positionIncrementGap= "100" > <analyzer type= "index" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "1" catenateNumbers= "1" catenateAll= "0" splitOnCaseChange= "0" /> <filter class= "solr.LowerCaseFilterFactory" /> <filter class= "solr.ReversedWildcardFilterFactory" withOriginal= "true" maxPosAsterisk= "3" maxPosQuestion= "2" maxFractionAsterisk= "0.33" /> </analyzer> <analyzer type= "query" > <tokenizer class= "solr.WhitespaceTokenizerFactory" /> <filter class= "solr.SynonymFilterFactory" synonyms= "synonyms.txt" ignoreCase= "true" expand= "true" /> <filter class= "solr.WordDelimiterFilterFactory" generateWordParts= "1" generateNumberParts= "1" catenateWords= "0" catenateNumbers= "0" catenateAll= "0" splitOnCaseChange= "0" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
          Hide
          Ahmet Arslan added a comment -

          Does any know how can we add ReversedWildcardFilterFactory to TestRandomChains?

          Show
          Ahmet Arslan added a comment - Does any know how can we add ReversedWildcardFilterFactory to TestRandomChains ?
          Hide
          Ahmet Arslan added a comment -

          I said/used stored by mistake. I was trying to figure out its relation to indexed property. This patch corrects this. Note that indexed=false fields can be highlighted if tokenizer defined for them.

          Only failing method is testIndexedMultiValued. Other three combinations pass.

           <field name="indexed_multiValued"       type="text_stx" indexed="true" stored="true"  multiValued="true"/>
          
          Show
          Ahmet Arslan added a comment - I said/used stored by mistake. I was trying to figure out its relation to indexed property. This patch corrects this. Note that indexed=false fields can be highlighted if tokenizer defined for them. Only failing method is testIndexedMultiValued. Other three combinations pass. <field name= "indexed_multiValued" type= "text_stx" indexed= "true" stored= "true" multiValued= "true" />
          Hide
          Ahmet Arslan added a comment -

          Hi Arun Kumar, I couldn't see your fix in your patch. Did you forget to add it accidentally? The patch you attached does not include OffsetLimitTokenFilter.java.patch. Can you re-attact it?

          Show
          Ahmet Arslan added a comment - Hi Arun Kumar , I couldn't see your fix in your patch. Did you forget to add it accidentally? The patch you attached does not include OffsetLimitTokenFilter.java.patch. Can you re-attact it?
          Hide
          Arun Kumar added a comment -

          In my last patch upload I accidentally missed out the main change, in this patch it is covered up.

          Show
          Arun Kumar added a comment - In my last patch upload I accidentally missed out the main change, in this patch it is covered up.
          Hide
          Ahmet Arslan added a comment -

          Thanks! Arun, so this is the fix

          -    if (offsetCount < offsetLimit && input.incrementToken()) {
          +    if (input.incrementToken() && offsetCount < offsetLimit) {
          

          with this test of SOLR-3193 passes too. Can you explain what is the magic here?

          Show
          Ahmet Arslan added a comment - Thanks! Arun, so this is the fix - if (offsetCount < offsetLimit && input.incrementToken()) { + if (input.incrementToken() && offsetCount < offsetLimit) { with this test of SOLR-3193 passes too. Can you explain what is the magic here?
          Hide
          Hoss Man added a comment -

          I've updated the patch to cleanup the test a bit...

          • renamed new schema field & added comment about it's purpose
          • refactored test to eliminate some duplication & move the test methods to the top
          • added firm assertions to the test to ensure highlighting is actually happening
            • this actually uncovered a bug in the test that it wasn't doing anything useful on the non-indexed fields because they didn't match any docs

          ...the change to OffsetLimitTokenFilter definitely fixes the problem – but i'm honestly not sure if that's the right fix – it's not clear to me why consuming the token before checking hte limit is the "correct" behavior (it seems counter intuitive to me) and makes me wonder if this is actually masking some other "real" bug in ReversedWildcardFilter

          Show
          Hoss Man added a comment - I've updated the patch to cleanup the test a bit... renamed new schema field & added comment about it's purpose refactored test to eliminate some duplication & move the test methods to the top added firm assertions to the test to ensure highlighting is actually happening this actually uncovered a bug in the test that it wasn't doing anything useful on the non-indexed fields because they didn't match any docs ...the change to OffsetLimitTokenFilter definitely fixes the problem – but i'm honestly not sure if that's the right fix – it's not clear to me why consuming the token before checking hte limit is the "correct" behavior (it seems counter intuitive to me) and makes me wonder if this is actually masking some other "real" bug in ReversedWildcardFilter
          Hide
          Uwe Schindler added a comment -

          Hi,
          the issue is in ReversedWildcardTokenFilter: The TokenFilter does not correctly implement reset(). If the TokenStream is reused and it was not completely consumed before, the state is still active. In that case it restores the "save" state and so injects a buggy state from the previous usage as first token into the new one.

          The fix is to make ReverseWildcardTokenFilter correctly implement reset() and NULL all state. The pseudo-fix in the highligters's tokenfilter just hides the bug.

          I will provide a patch in a minute!

          Show
          Uwe Schindler added a comment - Hi, the issue is in ReversedWildcardTokenFilter: The TokenFilter does not correctly implement reset(). If the TokenStream is reused and it was not completely consumed before, the state is still active. In that case it restores the "save" state and so injects a buggy state from the previous usage as first token into the new one. The fix is to make ReverseWildcardTokenFilter correctly implement reset() and NULL all state. The pseudo-fix in the highligters's tokenfilter just hides the bug. I will provide a patch in a minute!
          Hide
          Uwe Schindler added a comment -

          This is the correct fix:

          • Added missing reset() in ReverseWildCardFilter
          • made fields correct final, so its obvious which is state and which is config
          Show
          Uwe Schindler added a comment - This is the correct fix: Added missing reset() in ReverseWildCardFilter made fields correct final, so its obvious which is state and which is config
          Hide
          Hoss Man added a comment -

          Spoke to Uwe on IRC: he's AFK but asked me to go ahead and commit & backport on his behalf – will do as soon as full test&precommit finish.

          Show
          Hoss Man added a comment - Spoke to Uwe on IRC: he's AFK but asked me to go ahead and commit & backport on his behalf – will do as soon as full test&precommit finish.
          Hide
          ASF subversion and git services added a comment -

          Commit 1602525 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1602525 ]

          SOLR-5426: Fixed a bug in ReverseWildCardFilter that could cause InvalidTokenOffsetsException when highlighting

          Show
          ASF subversion and git services added a comment - Commit 1602525 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1602525 ] SOLR-5426 : Fixed a bug in ReverseWildCardFilter that could cause InvalidTokenOffsetsException when highlighting
          Hide
          Ahmet Arslan added a comment -

          Thanks Uwe! and Hoss for bringing closure.

          Show
          Ahmet Arslan added a comment - Thanks Uwe! and Hoss for bringing closure.
          Hide
          ASF subversion and git services added a comment -

          Commit 1602527 from hossman@apache.org in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1602527 ]

          SOLR-5426: Fixed a bug in ReverseWildCardFilter that could cause InvalidTokenOffsetsException when highlighting (merge r1602525)

          Show
          ASF subversion and git services added a comment - Commit 1602527 from hossman@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1602527 ] SOLR-5426 : Fixed a bug in ReverseWildCardFilter that could cause InvalidTokenOffsetsException when highlighting (merge r1602525)
          Hide
          Hoss Man added a comment -

          Thanks everybody!

          Show
          Hoss Man added a comment - Thanks everybody!

            People

            • Assignee:
              Hoss Man
              Reporter:
              Nikolay
            • Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development