Lucene - Core
  1. Lucene - Core
  2. LUCENE-1500

Highlighter throws StringIndexOutOfBoundsException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      Found this running the example code in Solr (latest version).

    • Lucene Fields:
      New, Patch Available

      Description

      Using the canonical Solr example (ant run-example) I added this document (using exampledocs/post.sh):

      <add><doc>
      <field name="id">Test for Highlighting StringIndexOutOfBoundsExcdption</field>
      <field name="name">Some Name</field>
      <field name="manu">Acme, Inc.</field>
      <field name="features">Description of the features, mentioning various things</field>
      <field name="features">Features also is multivalued</field>
      <field name="popularity">6</field>
      <field name="inStock">true</field>
      </doc></add>

      and then the URL http://localhost:8983/solr/select/?q=features&hl=true&hl.fl=features caused the exception.

      I have a patch. I don't know if it is completely correct, but it avoids this exception.

      1. LUCENE-1500.patch
        1 kB
        Peter Wolanin
      2. Lucene-1500-NewException.patch
        8 kB
        Mark Harwood
      3. Lucene-1500-NewException.patch
        7 kB
        Mark Harwood
      4. patch.txt
        1 kB
        David Bowen

        Activity

        Hide
        Michael McCandless added a comment -

        Should it not solve the issue?

        The change to highlighter (which is committed, on trunk) is simply to throw a more meaningful, checked exception (InvalidTokenOffsetsException) when the Token has out-of-bounds offsets. On upgrading to this, this is an exception your code needs to handle.

        But this issue doesn't fix the root causes that actually produce Tokens with incorrect offsets – that's very app specific. EG, the particular case that led to this issue was SOLR-925 (which is now fixed).

        Are you actually hitting this exception at runtime, or noting that you need to handle it at compilation time?

        Show
        Michael McCandless added a comment - Should it not solve the issue? The change to highlighter (which is committed, on trunk) is simply to throw a more meaningful, checked exception (InvalidTokenOffsetsException) when the Token has out-of-bounds offsets. On upgrading to this, this is an exception your code needs to handle. But this issue doesn't fix the root causes that actually produce Tokens with incorrect offsets – that's very app specific. EG, the particular case that led to this issue was SOLR-925 (which is now fixed). Are you actually hitting this exception at runtime, or noting that you need to handle it at compilation time?
        Hide
        Lars Olson added a comment - - edited

        Hi! with the risk of being a noob, I just have to ask...
        The patch Lucene-1500-NewException.patch, is it incorporated in the Lucene nightly builds as of 2009-04-27?

        I am now running my code against the nightly builds of Lucene20090427, and those libs give me the new error "InvalidTokenOffsetsException" at the exact same spot as the 2.4.1 libs return a StringIndexOutOfBounds exception. Should it not solve the issue?

        If not, could you maybe suggest a client coding which circumvents it?

        When using the Tika test-documents with search query "tika"
        The resulting 9 documents return 5 occurences of the exception.

        PS: I'm not running Solr, just POJO with lucene and Tika

        Show
        Lars Olson added a comment - - edited Hi! with the risk of being a noob, I just have to ask... The patch Lucene-1500-NewException.patch, is it incorporated in the Lucene nightly builds as of 2009-04-27? I am now running my code against the nightly builds of Lucene20090427, and those libs give me the new error "InvalidTokenOffsetsException" at the exact same spot as the 2.4.1 libs return a StringIndexOutOfBounds exception. Should it not solve the issue? If not, could you maybe suggest a client coding which circumvents it? When using the Tika test-documents with search query "tika" The resulting 9 documents return 5 occurences of the exception. PS: I'm not running Solr, just POJO with lucene and Tika
        Hide
        Mark Harwood added a comment -

        Committed in revision: 758460

        Show
        Mark Harwood added a comment - Committed in revision: 758460
        Hide
        Mark Harwood added a comment -

        With updated Apache license header.

        I'll commit soon if no objections

        Show
        Mark Harwood added a comment - With updated Apache license header. I'll commit soon if no objections
        Hide
        Michael McCandless added a comment -

        The InvalidTokenOffsetsException.java just needs a copyright header; otherwise patch looks good!

        Show
        Michael McCandless added a comment - The InvalidTokenOffsetsException.java just needs a copyright header; otherwise patch looks good!
        Hide
        Mark Harwood added a comment -

        Added support for testing both Token start or end offset >text.length.

        Added javadoc comments for new exception

        Show
        Mark Harwood added a comment - Added support for testing both Token start or end offset >text.length. Added javadoc comments for new exception
        Hide
        Mark Harwood added a comment -

        Will submit a new patch tonight.

        Show
        Mark Harwood added a comment - Will submit a new patch tonight.
        Hide
        Michael McCandless added a comment -

        Handing this one off to you Mark!

        Show
        Michael McCandless added a comment - Handing this one off to you Mark!
        Hide
        Mark Harwood added a comment -

        I struggle to see why endOffset<startOffset should ever be acceptable but also share your concerns about the disruption of changing the Token API to enforce this.

        So, I'll add code to the patch to check for bad startOffsets too. If we had more "points of use" for Token offsets outside of highlighting I'd be more concerned, but things being the way they are this seems like the most pragmatic option.

        Show
        Mark Harwood added a comment - I struggle to see why endOffset<startOffset should ever be acceptable but also share your concerns about the disruption of changing the Token API to enforce this. So, I'll add code to the patch to check for bad startOffsets too. If we had more "points of use" for Token offsets outside of highlighting I'd be more concerned, but things being the way they are this seems like the most pragmatic option.
        Hide
        Michael McCandless added a comment -

        What did you think of my suggestion to fix this problem at it's source - i.e. Token should never be in a state with end<start in the first place?

        I'm wary of this, for the same reason as cited above. I think the
        start/end offsets are rather opaque to Lucene, and I would lean
        towards asserting as little as possible about them during indexing.

        (Not to mention the challenges, as you enumerated, of actually doing
        this checking with the current API).

        I think at point-of-use (highlighting) we should assert the offsets
        are valid, for highlighting.

        Show
        Michael McCandless added a comment - What did you think of my suggestion to fix this problem at it's source - i.e. Token should never be in a state with end<start in the first place? I'm wary of this, for the same reason as cited above. I think the start/end offsets are rather opaque to Lucene, and I would lean towards asserting as little as possible about them during indexing. (Not to mention the challenges, as you enumerated, of actually doing this checking with the current API). I think at point-of-use (highlighting) we should assert the offsets are valid, for highlighting.
        Hide
        Mark Harwood added a comment -

        Isn't your example predicated on being given an invalid Token with end<start?

        What did you think of my suggestion to fix this problem at it's source - i.e. Token should never be in a state with end<start in the first place?

        Acheiving this goal is complicated by the fact that offsets are not only set in the constructor - there are independent set methods for start and end offsets which can be called in any order.
        One solution would be to deprecate Token.setStartOffset and Token.endOffset and replacing with a Token.setExtent(int startOffset, int endOffset) with the appropriate checks.

        Show
        Mark Harwood added a comment - Isn't your example predicated on being given an invalid Token with end<start? What did you think of my suggestion to fix this problem at it's source - i.e. Token should never be in a state with end<start in the first place? Acheiving this goal is complicated by the fact that offsets are not only set in the constructor - there are independent set methods for start and end offsets which can be called in any order. One solution would be to deprecate Token.setStartOffset and Token.endOffset and replacing with a Token.setExtent(int startOffset, int endOffset) with the appropriate checks.
        Hide
        Michael McCandless added a comment -

        I think given the addition of a checked exception to the Highlighter API, we should not back-port this to 2.4.1.

        Show
        Michael McCandless added a comment - I think given the addition of a checked exception to the Highlighter API, we should not back-port this to 2.4.1.
        Hide
        Michael McCandless added a comment -

        The problem is, this hits a StringIndexOutOfBoundsException:

        String s = "abcd";
        s.substring(3, 2);
        
        java.lang.StringIndexOutOfBoundsException: String index out of range: -1
            at java.lang.String.substring(String.java:1938)
        

        I think highlighter should guard against this?

        Show
        Michael McCandless added a comment - The problem is, this hits a StringIndexOutOfBoundsException: String s = "abcd" ; s.substring(3, 2); java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang. String .substring( String .java:1938) I think highlighter should guard against this?
        Hide
        Mark Harwood added a comment -

        My thoughts were that this exception solely traps inconsistencies with Tokens in relation to a particular provided chunk of text.

        I think internal inconsistencies within a Token (e.g. endOffset <startOffset) should ideally be handled by Token (throwing something like an IllegalArgumentException in it's constructor).
        I guess an open question there is can startOffset=endOffset in a Token? Either way, String.substring simply returns an empty string so I think that's probably OK in highlighter.

        Show
        Mark Harwood added a comment - My thoughts were that this exception solely traps inconsistencies with Tokens in relation to a particular provided chunk of text. I think internal inconsistencies within a Token (e.g. endOffset <startOffset) should ideally be handled by Token (throwing something like an IllegalArgumentException in it's constructor). I guess an open question there is can startOffset=endOffset in a Token? Either way, String.substring simply returns an empty string so I think that's probably OK in highlighter.
        Hide
        Michael McCandless added a comment -

        Patch looks great, thanks Mark!

        Do you also need to check that startOffset is within bounds? EG if both start & end offset == length of input text, what does substring do?

        If endOffset is < startOffset (seriously invalid token offsets) what happens?

        Maybe just add javadoc to the @throws explaining why this new exception would be thrown?

        Show
        Michael McCandless added a comment - Patch looks great, thanks Mark! Do you also need to check that startOffset is within bounds? EG if both start & end offset == length of input text, what does substring do? If endOffset is < startOffset (seriously invalid token offsets) what happens? Maybe just add javadoc to the @throws explaining why this new exception would be thrown?
        Hide
        Mark Harwood added a comment -

        Attached a patch with new checked exception.
        This will have a knock-on effect on all Highlighter client code (Solr?) as it introduces a new checked exception that must be handled.

        Show
        Mark Harwood added a comment - Attached a patch with new checked exception. This will have a knock-on effect on all Highlighter client code (Solr?) as it introduces a new checked exception that must be handled.
        Hide
        Koji Sekiguchi added a comment -

        Peter, thank you.

        In the thread you suggest that this API could be aded to lucene java?

        I've proposed CharFilter feature for Lucene in LUCENE-1466 . If you like it, please vote for it.

        Show
        Koji Sekiguchi added a comment - Peter, thank you. In the thread you suggest that this API could be aded to lucene java? I've proposed CharFilter feature for Lucene in LUCENE-1466 . If you like it, please vote for it.
        Hide
        Peter Wolanin added a comment -

        Koji - thanks - I was aware that not all worked with the mapping filter, but I was apparently misinformed since I was told that the "solr.HTMLStripWhitespaceTokenizerFactory" was also suitable for CharFilter. Indeed your e-mail thread linked from SOLR-822 describes exactly the problem I have:

        As you can see, if you use CharFilter, Token offsets could be incorrect because CharFilters may convert 1 char to 2 chars or the other way around.

        In the thread you suggest that this API could be aded to lucene java?

        Show
        Peter Wolanin added a comment - Koji - thanks - I was aware that not all worked with the mapping filter, but I was apparently misinformed since I was told that the "solr.HTMLStripWhitespaceTokenizerFactory" was also suitable for CharFilter. Indeed your e-mail thread linked from SOLR-822 describes exactly the problem I have: As you can see, if you use CharFilter, Token offsets could be incorrect because CharFilters may convert 1 char to 2 chars or the other way around. In the thread you suggest that this API could be aded to lucene java?
        Hide
        Koji Sekiguchi added a comment -

        for that field type. I will investigate more and post a SOLR issue.

        If you use charFilter, your Tokenizer must be a "CharStreamAware" Tokenizer. We have only two CharStreamAwareTokenizers - CharStreamAwareWhitespaceTokenizer and CharStreamAwareCJKTokenizer in Solr. Please consider using CharStreamAwareWhitespaceTokenizer in your case. To know what CharStreamAware Tokenizer is, see SOLR-822 .

        Show
        Koji Sekiguchi added a comment - for that field type. I will investigate more and post a SOLR issue. If you use charFilter, your Tokenizer must be a "CharStreamAware" Tokenizer. We have only two CharStreamAwareTokenizers - CharStreamAwareWhitespaceTokenizer and CharStreamAwareCJKTokenizer in Solr. Please consider using CharStreamAwareWhitespaceTokenizer in your case. To know what CharStreamAware Tokenizer is, see SOLR-822 .
        Hide
        Peter Wolanin added a comment -

        Ah, it occurs to me that we first saw this bug recently - and it seems likely it was only after starting to use :

         <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        

        for that field type. I will investigate more and post a SOLR issue.

        Show
        Peter Wolanin added a comment - Ah, it occurs to me that we first saw this bug recently - and it seems likely it was only after starting to use : <charFilter class= "solr.MappingCharFilterFactory" mapping= "mapping-ISOLatin1Accent.txt" /> for that field type. I will investigate more and post a SOLR issue.
        Hide
        Hoss Man added a comment -

        Peter: i tried some experiments with teh analyzer specified in the schema+fieldType you referenced, and i couldn't reproduce any problem.

        I suggest you open a new Jira issue against Solr and attach some sort of reproducible test (either a junit test OR a patch against the example configs + an example doc to index and an example query that either triggers an error or highlights an incorrect substring) so people have a starting point for trying to figure where the incorrect offsets are coming from.

        LUCENE-1500 can/should stay open, but should focus specifically on the issue of what Highlighter's behavior should be when dealing with offsets that go past the end of the string (there seems to be consensus that it should do something different then IndexOutOfBounds, it just doesn't seem to be clear what the new behavior should be)

        Show
        Hoss Man added a comment - Peter: i tried some experiments with teh analyzer specified in the schema+fieldType you referenced, and i couldn't reproduce any problem. I suggest you open a new Jira issue against Solr and attach some sort of reproducible test (either a junit test OR a patch against the example configs + an example doc to index and an example query that either triggers an error or highlights an incorrect substring) so people have a starting point for trying to figure where the incorrect offsets are coming from. LUCENE-1500 can/should stay open, but should focus specifically on the issue of what Highlighter's behavior should be when dealing with offsets that go past the end of the string (there seems to be consensus that it should do something different then IndexOutOfBounds, it just doesn't seem to be clear what the new behavior should be)
        Hide
        Peter Wolanin added a comment -

        I'm still trying to get a handle on how these pieces fit together., so sorry if I've jumped to the wrong conclusion. If the analyzer is where the offsets are calculated, then that sounds like the place to look.

        The field does use term vectors. The field uses this type from the Solr schema:

            <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
        

        The full schema is
        http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.30&pathrev=DRUPAL-6--1

        the field is

        <field name="body" type="text" indexed="true" stored="true" termVectors="true"/>
        

        in case it's relevant, the solrconfig is:
        http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/apachesolr/solrconfig.xml?revision=1.1.2.15&pathrev=DRUPAL-6--1

        Show
        Peter Wolanin added a comment - I'm still trying to get a handle on how these pieces fit together., so sorry if I've jumped to the wrong conclusion. If the analyzer is where the offsets are calculated, then that sounds like the place to look. The field does use term vectors. The field uses this type from the Solr schema: <fieldType name= "text" class= "solr.TextField" positionIncrementGap= "100" > The full schema is http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.30&pathrev=DRUPAL-6--1 the field is <field name= "body" type= "text" indexed= " true " stored= " true " termVectors= " true " /> in case it's relevant, the solrconfig is: http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/apachesolr/solrconfig.xml?revision=1.1.2.15&pathrev=DRUPAL-6--1
        Hide
        Michael McCandless added a comment -

        But that class either uses TermVectors stored in the index, or re-analyzes using the analyzer passed to it, I think? Are you using TermVectors?

        Show
        Michael McCandless added a comment - But that class either uses TermVectors stored in the index, or re-analyzes using the analyzer passed to it, I think? Are you using TermVectors?
        Hide
        Hoss Man added a comment -

        The extent of my tracing suggests it's coming when the token stream is generated, which looks to be part of the lucene highlighter: org.apache.lucene.search.highlight.TokenSources

        with my limited knowledge of solr highlighting, that really doesn't sound right.

        If the field has TermVectors, then the TokenStream used comes from there, otherwise it coems from analyzing the stored field value – either way the analyzer configured in Solr should be setting the offset values, correct?

        hence my question on the solr thread where this first came up...
        http://www.nabble.com/Error-with-highlighter-and-UTF-8-chars--to22156161.html#a22207917

        so what does the analysis screen tell you about each token produced with that input text given your configuration? in verbose mode it will show the start/end offsets for every token, so it should be fairly easy to identify where the bug is.

        Show
        Hoss Man added a comment - The extent of my tracing suggests it's coming when the token stream is generated, which looks to be part of the lucene highlighter: org.apache.lucene.search.highlight.TokenSources with my limited knowledge of solr highlighting, that really doesn't sound right. If the field has TermVectors, then the TokenStream used comes from there, otherwise it coems from analyzing the stored field value – either way the analyzer configured in Solr should be setting the offset values, correct? hence my question on the solr thread where this first came up... http://www.nabble.com/Error-with-highlighter-and-UTF-8-chars--to22156161.html#a22207917 so what does the analysis screen tell you about each token produced with that input text given your configuration? in verbose mode it will show the start/end offsets for every token, so it should be fairly easy to identify where the bug is.
        Hide
        Peter Wolanin added a comment -

        I am using Solr, but with a single value field. I'm using the current Solr build (includes the fix), so the bug I'm describing, which triggers the same exception as the prior Solr bug did, is still present and unrelated to SOLR-925.

        The extent of my tracing suggests it's coming when the token stream is generated, which looks to be part of the lucene highlighter: org.apache.lucene.search.highlight.TokenSources

        Show
        Peter Wolanin added a comment - I am using Solr, but with a single value field. I'm using the current Solr build (includes the fix), so the bug I'm describing, which triggers the same exception as the prior Solr bug did, is still present and unrelated to SOLR-925 . The extent of my tracing suggests it's coming when the token stream is generated, which looks to be part of the lucene highlighter: org.apache.lucene.search.highlight.TokenSources
        Hide
        Michael McCandless added a comment -

        I thought the bug was in the analyzer (producing out-of-bounds token offsets), not highlighter?

        The original Solr bug (SOLR-925), which I think was the analyzer producing invalid offsets, is now fixed. Peter are you using Solr, or are you seeing this in a different context?

        Show
        Michael McCandless added a comment - I thought the bug was in the analyzer (producing out-of-bounds token offsets), not highlighter? The original Solr bug ( SOLR-925 ), which I think was the analyzer producing invalid offsets, is now fixed. Peter are you using Solr, or are you seeing this in a different context?
        Hide
        Peter Wolanin added a comment -

        The bug we are seeing now happens on pretty much every document that contains multi-byte characters, but only sometime was it going past the end of the full string and hitting the exception. With the patch, the bug is still very evident, it jsut prevents the exception. I's a serious flaw in the highlighter - maybe using some only non-utf-8 aware method to calculate string lengths?

        Show
        Peter Wolanin added a comment - The bug we are seeing now happens on pretty much every document that contains multi-byte characters, but only sometime was it going past the end of the full string and hitting the exception. With the patch, the bug is still very evident, it jsut prevents the exception. I's a serious flaw in the highlighter - maybe using some only non-utf-8 aware method to calculate string lengths?
        Hide
        Michael McCandless added a comment -

        This feels to me like one of those "there's something seriously wrong with the codebase" problems

        My guess is it can happen quite easily and not be so obvious until the
        "right" text goes through the analyzer. If Solr's analysis (for
        highlighting multi-valued fields) can hit this my guess is it's (wrong
        offsets coming out of analyzer) not uncommon...

        I'd still prefer #2, but I don't feel super strongly about it, so if
        there's no consensus otherwise, it's fine with me to go forward
        with #1.

        Show
        Michael McCandless added a comment - This feels to me like one of those "there's something seriously wrong with the codebase" problems My guess is it can happen quite easily and not be so obvious until the "right" text goes through the analyzer. If Solr's analysis (for highlighting multi-valued fields) can hit this my guess is it's (wrong offsets coming out of analyzer) not uncommon... I'd still prefer #2, but I don't feel super strongly about it, so if there's no consensus otherwise, it's fine with me to go forward with #1.
        Hide
        Peter Wolanin added a comment -

        Well, this patch does not (obviously) solve the real bug. Is it possible to combine #1 and #3, but possibly revert #3 later when we solve the real bug in the highlighter code?

        Show
        Peter Wolanin added a comment - Well, this patch does not (obviously) solve the real bug. Is it possible to combine #1 and #3, but possibly revert #3 later when we solve the real bug in the highlighter code?
        Hide
        Mark Harwood added a comment -

        OK - choices are:

        1) Throw a RuntimeException with a more useful diagnostic message
        2) Throw a new checked Exception (introducing possible compile errors in existing client code)
        3) Check for the error condition and ignore (as done in the current patch)

        This feels to me like one of those "there's something seriously wrong with the codebase" problems rather than an invalid bit of data or user input which is external to the system so my personal preference is to lean towards 1).

        Show
        Mark Harwood added a comment - OK - choices are: 1) Throw a RuntimeException with a more useful diagnostic message 2) Throw a new checked Exception (introducing possible compile errors in existing client code) 3) Check for the error condition and ignore (as done in the current patch) This feels to me like one of those "there's something seriously wrong with the codebase" problems rather than an invalid bit of data or user input which is external to the system so my personal preference is to lean towards 1).
        Hide
        Michael McCandless added a comment -

        Mark, do you want/have time to take this one? I'd like to resolve it before 2.4.1 if we can...

        Show
        Michael McCandless added a comment - Mark, do you want/have time to take this one? I'd like to resolve it before 2.4.1 if we can...
        Hide
        Michael McCandless added a comment -

        So to be consistent, where else in Lucene might an "IncorrectTokenOffsetsException" be a possibility - IndexWriter.addDocument(..)?

        Good question!

        I don't think we can throw it from addDocument, because we cannot
        assume/assert that the int startOffset & endOffset are in fact
        character offsets into the String/Reader we had been given?

        Technically, to Lucene the start/end offsets are somewhat opaque (I
        think?). It's not until you actually do something with them (eg, call
        the highlighter) that you are saying "these are really supposed to be
        character offsets into the specific text I just provided you".

        Show
        Michael McCandless added a comment - So to be consistent, where else in Lucene might an "IncorrectTokenOffsetsException" be a possibility - IndexWriter.addDocument(..)? Good question! I don't think we can throw it from addDocument, because we cannot assume/assert that the int startOffset & endOffset are in fact character offsets into the String/Reader we had been given? Technically, to Lucene the start/end offsets are somewhat opaque (I think?). It's not until you actually do something with them (eg, call the highlighter) that you are saying "these are really supposed to be character offsets into the specific text I just provided you".
        Hide
        Mark Harwood added a comment -

        So to be consistent, where else in Lucene might an "IncorrectTokenOffsetsException" be a possibility - IndexWriter.addDocument(..)?

        Show
        Mark Harwood added a comment - So to be consistent, where else in Lucene might an "IncorrectTokenOffsetsException" be a possibility - IndexWriter.addDocument(..)?
        Hide
        Michael McCandless added a comment -

        The thing is, since it's an unchecked exception, you could be well
        into production, serving searches, etc, when suddenly unexpectedly
        this exception kills the entire search request. It's brittle.

        (vs highlighting the wrong parts but still providing search results to
        the user – graceful degradation).

        So I'm still torn on "brittle" (what we have now) vs "graceful"
        (current "workaround" patch) handling. For exceptions that can strike
        arbitrarily deep into production/searching, I usually lean towards
        being "graceful".

        Or... maybe we could introduce a new exception to call this case out
        (IncorrectTokenOffsetsException or something?), but: we make it a
        checked exception so at compile time you need to handle it?

        (I still don't like that we only catch the error intermittantly... but there's
        nothing we could really do about that, so javadocs should spell that
        out).

        Show
        Michael McCandless added a comment - The thing is, since it's an unchecked exception, you could be well into production, serving searches, etc, when suddenly unexpectedly this exception kills the entire search request. It's brittle. (vs highlighting the wrong parts but still providing search results to the user – graceful degradation). So I'm still torn on "brittle" (what we have now) vs "graceful" (current "workaround" patch) handling. For exceptions that can strike arbitrarily deep into production/searching, I usually lean towards being "graceful". Or... maybe we could introduce a new exception to call this case out (IncorrectTokenOffsetsException or something?), but: we make it a checked exception so at compile time you need to handle it? (I still don't like that we only catch the error intermittantly... but there's nothing we could really do about that, so javadocs should spell that out).
        Hide
        Hoss Man added a comment -

        Perhaps we can turn this around and ask "under what conditions is it acceptable to provide a TokenStream with tokens whose offsets exceed the length of the text provided?". Not sure I see a justifiable case for supporting that as a legitimate scenario and I would prefer the reporting of an error in this case.

        + 1 to reporting an error ... but StringIndexOutOfounds is not a helpful error. Code should certainly check the offsets before calling substring and throw a specific exception identifying which token has an invalid offset

        Show
        Hoss Man added a comment - Perhaps we can turn this around and ask "under what conditions is it acceptable to provide a TokenStream with tokens whose offsets exceed the length of the text provided?". Not sure I see a justifiable case for supporting that as a legitimate scenario and I would prefer the reporting of an error in this case. + 1 to reporting an error ... but StringIndexOutOfounds is not a helpful error. Code should certainly check the offsets before calling substring and throw a specific exception identifying which token has an invalid offset
        Hide
        Koji Sekiguchi added a comment -

        Can you post your document and schema so that others can reproduce your problem?

        Show
        Koji Sekiguchi added a comment - Can you post your document and schema so that others can reproduce your problem?
        Hide
        Peter Wolanin added a comment -

        Yes - this patch is not a fix - but a work-around.

        The root cause is clearly somewhere in the code generating the token stream - tokens seem to be getting positions in bytes rather than characters.

        DefaultSolrHighlighter.java has this code:

        import org.apache.lucene.search.highlight.TokenSources;
        
        ...
        
                    // create TokenStream
                    try {
                      // attempt term vectors
                      if( tots == null )
                        tots = new TermOffsetsTokenStream( TokenSources.getTokenStream(searcher.getReader(), docId, fieldName) );
                      tstream = tots.getMultiValuedTokenStream( docTexts[j].length() );
                    }
                    catch (IllegalArgumentException e) {
                      // fall back to anaylzer
                      tstream = new TokenOrderingFilter(schema.getAnalyzer().tokenStream(fieldName, new StringReader(docTexts[j])), 10);
                    }
        
        Show
        Peter Wolanin added a comment - Yes - this patch is not a fix - but a work-around. The root cause is clearly somewhere in the code generating the token stream - tokens seem to be getting positions in bytes rather than characters. DefaultSolrHighlighter.java has this code: import org.apache.lucene.search.highlight.TokenSources; ... // create TokenStream try { // attempt term vectors if ( tots == null ) tots = new TermOffsetsTokenStream( TokenSources.getTokenStream(searcher.getReader(), docId, fieldName) ); tstream = tots.getMultiValuedTokenStream( docTexts[j].length() ); } catch (IllegalArgumentException e) { // fall back to anaylzer tstream = new TokenOrderingFilter(schema.getAnalyzer().tokenStream(fieldName, new StringReader(docTexts[j])), 10); }
        Hide
        Mark Harwood added a comment -

        Hmmm. I'm not so sure that this "defensive coding" patch is the right thing to do here.

        One could argue that it is obscuring an error condition further upstream (as you suggest, Mike - a dodgy analyzer). Commiting this will only make these errors harder to detect e.g. we'd get forum posts saying "why doesn't my term get highlighted?"

        Perhaps we can turn this around and ask "under what conditions is it acceptable to provide a TokenStream with tokens whose offsets exceed the length of the text provided?".
        Not sure I see a justifiable case for supporting that as a legitimate scenario and I would prefer the reporting of an error in this case.

        Show
        Mark Harwood added a comment - Hmmm. I'm not so sure that this "defensive coding" patch is the right thing to do here. One could argue that it is obscuring an error condition further upstream (as you suggest, Mike - a dodgy analyzer). Commiting this will only make these errors harder to detect e.g. we'd get forum posts saying "why doesn't my term get highlighted?" Perhaps we can turn this around and ask "under what conditions is it acceptable to provide a TokenStream with tokens whose offsets exceed the length of the text provided?". Not sure I see a justifiable case for supporting that as a legitimate scenario and I would prefer the reporting of an error in this case.
        Hide
        Michael McCandless added a comment -

        I think the defensive coding (the current patch) makes sense, to avoid out-of-bounds substring call.

        But what is the root cause here? It seems likely the analyzer chain is not producing the right (original) start/end offset in the tokens? What analyzer chain are you using?

        Show
        Michael McCandless added a comment - I think the defensive coding (the current patch) makes sense, to avoid out-of-bounds substring call. But what is the root cause here? It seems likely the analyzer chain is not producing the right (original) start/end offset in the tokens? What analyzer chain are you using?
        Hide
        Peter Wolanin added a comment -

        Actually - the initial patch does not avoid the exception I'm seeing, since the start of the token is ok, but the end is beyond the string's end. Here is a slightly enhanced version that checks both the start and end of the token.

        Show
        Peter Wolanin added a comment - Actually - the initial patch does not avoid the exception I'm seeing, since the start of the token is ok, but the end is beyond the string's end. Here is a slightly enhanced version that checks both the start and end of the token.
        Hide
        Peter Wolanin added a comment - - edited

        Actually, looking at the Lucene source and the trace:

        java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
        	at java.lang.String.substring(String.java:1765)
        	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
        	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:313)
        	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:84)
        	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
        	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
               ...
        

        I see now that getBestTextFragments() takes in a token stream - and each token in this steam already has start/end positions set. So, this patch would mitigate the exception, but looks like the real bug is in Solr, or perhaps elsewhere in Lucene where the token stream is constructed.

        Show
        Peter Wolanin added a comment - - edited Actually, looking at the Lucene source and the trace: java.lang.StringIndexOutOfBoundsException: String index out of range: 2822 at java.lang. String .substring( String .java:1765) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:313) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:84) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) ... I see now that getBestTextFragments() takes in a token stream - and each token in this steam already has start/end positions set. So, this patch would mitigate the exception, but looks like the real bug is in Solr, or perhaps elsewhere in Lucene where the token stream is constructed.
        Hide
        Peter Wolanin added a comment - - edited

        I have run into this issue over the last couple days. Also using Solr, but the error is triggered by content that has multi-byte characters (such as German).

        It seems that somewhere Lucene is counting bytes instead of characters, so each substring the highlighter tries to select is offset further forward in the string being matched.

        here's an example trying to highlight the string 'Drupaltalk' with "strong" tags

        <p class="search-snippet">
         Community ist - und dieses Portal Dr<strong>upaltalk.d</strong>e samt seinem schon eifrigen Benutzer- und Gästezulauf ( ... 
         nter "Dru<strong>paltalk001</strong>" könnt Ihr die erste Konferenz noch mal nachhören und erfahren, wie Selbstorganisation in der Drupal Szene funktioniert.  
         "Dru<strong>paltalk002</strong>" ist dann der Talk vom Dienstag zum Thema "Drupal Al</p>
        

        So the attached patch would probably avoid the exception (and is a good idea) but would not fix the bug I'm seeing.

        Show
        Peter Wolanin added a comment - - edited I have run into this issue over the last couple days. Also using Solr, but the error is triggered by content that has multi-byte characters (such as German). It seems that somewhere Lucene is counting bytes instead of characters, so each substring the highlighter tries to select is offset further forward in the string being matched. here's an example trying to highlight the string 'Drupaltalk' with "strong" tags <p class= "search-snippet" > Community ist - und dieses Portal Dr<strong>upaltalk.d</strong>e samt seinem schon eifrigen Benutzer- und Gästezulauf ( ... nter "Dru<strong>paltalk001</strong>" könnt Ihr die erste Konferenz noch mal nachhören und erfahren, wie Selbstorganisation in der Drupal Szene funktioniert. "Dru<strong>paltalk002</strong>" ist dann der Talk vom Dienstag zum Thema "Drupal Al</p> So the attached patch would probably avoid the exception (and is a good idea) but would not fix the bug I'm seeing.
        Hide
        David Bowen added a comment -

        That makes sense, thank-you Koji.

        I think it may be worth incorporating this patch anyway, since it is a small change which makes the method more robust. It just eliminates the possibility of running off the end of the text string.

        Show
        David Bowen added a comment - That makes sense, thank-you Koji. I think it may be worth incorporating this patch anyway, since it is a small change which makes the method more robust. It just eliminates the possibility of running off the end of the text string.
        Hide
        Koji Sekiguchi added a comment -

        I think the bug in Solr, not Lucene. Please see SOLR-925, a patch is there.

        Show
        Koji Sekiguchi added a comment - I think the bug in Solr, not Lucene. Please see SOLR-925 , a patch is there.

          People

          • Assignee:
            Mark Harwood
            Reporter:
            David Bowen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development