Lucene - Core
  1. Lucene - Core
  2. LUCENE-5166

PostingsHighlighter fails with IndexOutOfBoundsException

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.4
    • Fix Version/s: 4.5, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Given a document with a match at a startIndex < PostingsHighlighter.maxlength and an endIndex > PostingsHighlighter.maxLength, DefaultPassageFormatter will throw an IndexOutOfBoundsException when DefaultPassageFormatter.append() is invoked.

      1. LUCENE-5166.patch
        5 kB
        Robert Muir
      2. LUCENE-5166.patch
        6 kB
        Michael McCandless
      3. LUCENE-5166.patch
        4 kB
        Robert Muir
      4. LUCENE-5166.patch
        4 kB
        Robert Muir
      5. LUCENE-5166.patch
        6 kB
        Robert Muir
      6. LUCENE-5166.patch
        4 kB
        Manuel Amoabeng
      7. LUCENE-5166-2.patch
        4 kB
        Manuel Amoabeng
      8. LUCENE-5166-revisited.patch
        3 kB
        Manuel Amoabeng

        Activity

        Hide
        Manuel Amoabeng added a comment - - edited

        java.lang.IndexOutOfBoundsException: start 9999, end 10004, s.length() 10000
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:453)
        at java.lang.StringBuilder.append(StringBuilder.java:179)
        at org.apache.lucene.search.postingshighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:135)
        at org.apache.lucene.search.postingshighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:79)
        at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:435)
        at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:353)
        at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:268)

        Show
        Manuel Amoabeng added a comment - - edited java.lang.IndexOutOfBoundsException: start 9999, end 10004, s.length() 10000 at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:453) at java.lang.StringBuilder.append(StringBuilder.java:179) at org.apache.lucene.search.postingshighlight.DefaultPassageFormatter.append(DefaultPassageFormatter.java:135) at org.apache.lucene.search.postingshighlight.DefaultPassageFormatter.format(DefaultPassageFormatter.java:79) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:435) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:353) at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:268)
        Hide
        Shai Erera added a comment -

        Can you show a real usecase for a document matching beyond content.length()? Your patch artificially creates an out-of-bound Passage, but I think it's better if we can see a real usecase, e.g. maybe a combination of TokenFilters may cause that?

        But if e.g. the app indexed content1 but then tries to highlight content2, I don't think that's a supported usecase...

        Show
        Shai Erera added a comment - Can you show a real usecase for a document matching beyond content.length()? Your patch artificially creates an out-of-bound Passage, but I think it's better if we can see a real usecase, e.g. maybe a combination of TokenFilters may cause that? But if e.g. the app indexed content1 but then tries to highlight content2, I don't think that's a supported usecase...
        Hide
        Manuel Amoabeng added a comment -

        Please find attached another test case. It is sort of bad luck to run into this in a real use case but it actually happened to me.

        Show
        Manuel Amoabeng added a comment - Please find attached another test case. It is sort of bad luck to run into this in a real use case but it actually happened to me.
        Hide
        Shai Erera added a comment -

        Interesting. FYI, I will not be available in the next 2 weeks, and haven't reproduced it yet. If no one assigns himself to the issue, I will when I'm back.

        Show
        Shai Erera added a comment - Interesting. FYI, I will not be available in the next 2 weeks, and haven't reproduced it yet. If no one assigns himself to the issue, I will when I'm back.
        Hide
        Robert Muir added a comment -

        I have reproduced it with Manuel's test

        Show
        Robert Muir added a comment - I have reproduced it with Manuel's test
        Hide
        Robert Muir added a comment -

        Attached is just a combined patch of Manuel's 2 patches.

        There is definitely bug(s) here.

        As far as the fix, to me the big question (I put it in a nocommit to his test case), is if formatter classes should really have to deal with these cases.

        Show
        Robert Muir added a comment - Attached is just a combined patch of Manuel's 2 patches. There is definitely bug(s) here. As far as the fix, to me the big question (I put it in a nocommit to his test case), is if formatter classes should really have to deal with these cases.
        Hide
        Michael McCandless added a comment -

        I think we shouldn't send "out of bounds" matches to the formatter? Because all it can do is bounds check and skip it?

        I think maybe we also shouldn't even send the passage if it was "truncated", even if some matches were before the truncation?

        Show
        Michael McCandless added a comment - I think we shouldn't send "out of bounds" matches to the formatter? Because all it can do is bounds check and skip it? I think maybe we also shouldn't even send the passage if it was "truncated", even if some matches were before the truncation?
        Hide
        Robert Muir added a comment -

        OK here's a patch. the cause of the bug is that we only know startOffsets are always increasing (the algorithm relies on this, and merges them across terms).

        So we cannot safely terminate when end >= limit (only start >= limit), but we don't have to confuse the formatter with the cases of terms that 'span' the limit.

        Show
        Robert Muir added a comment - OK here's a patch. the cause of the bug is that we only know startOffsets are always increasing (the algorithm relies on this, and merges them across terms). So we cannot safely terminate when end >= limit (only start >= limit), but we don't have to confuse the formatter with the cases of terms that 'span' the limit.
        Hide
        Michael McCandless added a comment -

        Hmm so this means we may pick a truncated passage to present? I suppose it's unlikely to score well ... just seems bad though.

        Wait, couldn't we fix passageQueue.offer(current) to not offer it if current.endOffset == contentLength?

        Show
        Michael McCandless added a comment - Hmm so this means we may pick a truncated passage to present? I suppose it's unlikely to score well ... just seems bad though. Wait, couldn't we fix passageQueue.offer(current) to not offer it if current.endOffset == contentLength?
        Hide
        Robert Muir added a comment -

        Well the passage may not be truncated: for example depending on the analyzer (e.g. ngrams or something), it could just be that the term "spans sentences".

        And the problem is: only startOffset is guaranteed to be increasing. so if we discard the passage just because of this, it could be unsafe since new terms that are "in bounds" have still yet to be seen...

        Show
        Robert Muir added a comment - Well the passage may not be truncated: for example depending on the analyzer (e.g. ngrams or something), it could just be that the term "spans sentences". And the problem is: only startOffset is guaranteed to be increasing. so if we discard the passage just because of this, it could be unsafe since new terms that are "in bounds" have still yet to be seen...
        Hide
        Robert Muir added a comment -

        Improved patch, thank you Mike

        Show
        Robert Muir added a comment - Improved patch, thank you Mike
        Hide
        Michael McCandless added a comment -

        I'm still confused about how we handle the "truncated passage" ... I added a [failing] test but maybe the test is bogus?

        Show
        Michael McCandless added a comment - I'm still confused about how we handle the "truncated passage" ... I added a [failing] test but maybe the test is bogus?
        Hide
        Robert Muir added a comment -

        Well the patch doesnt attempt to change anything about the breakiterator logic: so your test is "valid" but testing something different.

        Let me try to explain:
        The bug Manuel found here is "matching term spanning the content boundary". so lets call this "truncated term". This patch fixes this so formatter doesnt have to deal with it, and there is no AIOOBE or strange checks in the formatter.

        The test you write is for different behavior: it saying, if the passage itself spans the content boundary, don't present it to the formatter at all. But, this is sorta a different issue, its already handled here today by Math.min and the formatter never has to deal with it:

                // advance breakiterator
                assert BreakIterator.DONE < 0;
                current.startOffset = Math.max(bi.preceding(start+1), 0);
                current.endOffset = Math.min(bi.next(), contentLength);
        

        If we want to change the behavior for this, thats cool too, but its not really related to the bug at hand. I am not opposed to fixing it, but one problem would be someone using stuff like WholeBreakIterator, though we could "solve" it in a different way by having WholeBreakIterator take in the limit itself (I dont know if this would address your concern though).

        Show
        Robert Muir added a comment - Well the patch doesnt attempt to change anything about the breakiterator logic: so your test is "valid" but testing something different. Let me try to explain: The bug Manuel found here is "matching term spanning the content boundary". so lets call this "truncated term". This patch fixes this so formatter doesnt have to deal with it, and there is no AIOOBE or strange checks in the formatter. The test you write is for different behavior: it saying, if the passage itself spans the content boundary, don't present it to the formatter at all. But, this is sorta a different issue, its already handled here today by Math.min and the formatter never has to deal with it: // advance breakiterator assert BreakIterator.DONE < 0; current.startOffset = Math .max(bi.preceding(start+1), 0); current.endOffset = Math .min(bi.next(), contentLength); If we want to change the behavior for this, thats cool too, but its not really related to the bug at hand. I am not opposed to fixing it, but one problem would be someone using stuff like WholeBreakIterator, though we could "solve" it in a different way by having WholeBreakIterator take in the limit itself (I dont know if this would address your concern though).
        Hide
        Michael McCandless added a comment -

        OK let's not try to address that on this issue ... I'm not even sure it needs fixing. It ought to be rare-ish that a truncated passage is selected.

        Show
        Michael McCandless added a comment - OK let's not try to address that on this issue ... I'm not even sure it needs fixing. It ought to be rare-ish that a truncated passage is selected.
        Hide
        Robert Muir added a comment -

        There was a bug in my patch: I added another unit test for this!

        I think its ready.

        Show
        Robert Muir added a comment - There was a bug in my patch: I added another unit test for this! I think its ready.
        Hide
        Michael McCandless added a comment -

        +1

        Tricky!

        Show
        Michael McCandless added a comment - +1 Tricky!
        Hide
        ASF subversion and git services added a comment -

        Commit 1513207 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1513207 ]

        LUCENE-5166: PostingsHighlighter fails with IndexOutOfBoundsException

        Show
        ASF subversion and git services added a comment - Commit 1513207 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1513207 ] LUCENE-5166 : PostingsHighlighter fails with IndexOutOfBoundsException
        Hide
        ASF subversion and git services added a comment -

        Commit 1513231 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1513231 ]

        LUCENE-5166: PostingsHighlighter fails with IndexOutOfBoundsException

        Show
        ASF subversion and git services added a comment - Commit 1513231 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1513231 ] LUCENE-5166 : PostingsHighlighter fails with IndexOutOfBoundsException
        Hide
        Robert Muir added a comment -

        Thank you Manuel!

        Show
        Robert Muir added a comment - Thank you Manuel!
        Hide
        Manuel Amoabeng added a comment -

        Thank you for the quick help!

        Show
        Manuel Amoabeng added a comment - Thank you for the quick help!
        Hide
        Manuel Amoabeng added a comment -

        I just found another problem here: If we have both, matches that do and matches that don't span the content boundary the formatter is asked to highlight the spanning match.
        Please find attached additional tests and a possible fix for this.

        Show
        Manuel Amoabeng added a comment - I just found another problem here: If we have both, matches that do and matches that don't span the content boundary the formatter is asked to highlight the spanning match. Please find attached additional tests and a possible fix for this.
        Hide
        Robert Muir added a comment -

        Hi Manuel: thank you! Another bug, or a bug in my fix to the other bug

        I'll investigate deeper in a bit.

        Show
        Robert Muir added a comment - Hi Manuel: thank you! Another bug, or a bug in my fix to the other bug I'll investigate deeper in a bit.
        Hide
        Robert Muir added a comment -

        Manuel: your fix is correct, thank you.

        To explain: I had totally forgotten about this little loop on tf within the passage (i had removed this optimization in LUCENE-4909, which didnt turn out to work that great, so wasn't committed).

        We might at some point want to still just remove the optimization just based on the reason that it makes this thing more complicated, it was just intended to speed up the worst case (where someone has very common stopwords and stuff like that).

        But for now to complete the bugfix, we should commit your patch (LUCENE-5166-revisited.patch).

        Show
        Robert Muir added a comment - Manuel: your fix is correct, thank you. To explain: I had totally forgotten about this little loop on tf within the passage (i had removed this optimization in LUCENE-4909 , which didnt turn out to work that great, so wasn't committed). We might at some point want to still just remove the optimization just based on the reason that it makes this thing more complicated, it was just intended to speed up the worst case (where someone has very common stopwords and stuff like that). But for now to complete the bugfix, we should commit your patch ( LUCENE-5166 -revisited.patch).
        Hide
        ASF subversion and git services added a comment -

        Commit 1514367 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1514367 ]

        LUCENE-5166: also fix and test this case where tf > 1 within the passage for a term

        Show
        ASF subversion and git services added a comment - Commit 1514367 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1514367 ] LUCENE-5166 : also fix and test this case where tf > 1 within the passage for a term
        Hide
        ASF subversion and git services added a comment -

        Commit 1514379 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1514379 ]

        LUCENE-5166: also fix and test this case where tf > 1 within the passage for a term

        Show
        ASF subversion and git services added a comment - Commit 1514379 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1514379 ] LUCENE-5166 : also fix and test this case where tf > 1 within the passage for a term
        Hide
        Robert Muir added a comment -

        Thank you again!

        Show
        Robert Muir added a comment - Thank you again!
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close
        Hide
        ASF subversion and git services added a comment -

        Commit 1594464 from Robert Muir in branch 'dev/branches/lucene5666'
        [ https://svn.apache.org/r1594464 ]

        LUCENE-5166: clear most nocommits, move ord/rord to solr (and speed them up), nuke old purging stuff

        Show
        ASF subversion and git services added a comment - Commit 1594464 from Robert Muir in branch 'dev/branches/lucene5666' [ https://svn.apache.org/r1594464 ] LUCENE-5166 : clear most nocommits, move ord/rord to solr (and speed them up), nuke old purging stuff

          People

          • Assignee:
            Unassigned
            Reporter:
            Manuel Amoabeng
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development