Lucene - Core
  1. Lucene - Core
  2. LUCENE-2587

Highlighter picks wrong offset for fragment boundaries

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Trivial Trivial
    • Resolution: Unresolved
    • Affects Version/s: 3.0.2
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
    • Environment:

      Java 6 + Lucene 3.0.2

    • Lucene Fields:
      New

      Description

      I have written a new Fragmenter since we need fragments for hitlines to be on sentence boundaries and not cross paragraphs.
      When using it with org.apache.lucene.search.highlight.Highlighter, I get hitlines that starts with ". ", "? ", "! "...

      Consider the text "A b c d e. F g h i j! K l m n o. "
      which become the tokenstream : (A) (b) (c) (d) (e) (F) (g) (h) (j) (K) (l) (m) (o)

      If the fragmenter return isNewFragment() = true on F and K and Highlighter pick the middle fragment, lets say we search on "g" the hitline becomes:
      ". F <B>g</B> h i j"

      The reason, it seems, is that the offset to the fragment boundaries found by taking the endOffset of the last token in a fragment ,
      not the startOffset of the first.

      TJ

      1. TestIMSentenceFragmenter.java
        13 kB
        Terje Eggestad
      2. IMSentenceFragmenter.java
        9 kB
        Terje Eggestad
      3. LUCENE-2587.patch
        3 kB
        Roberto Minelli

        Activity

        Hide
        Roberto Minelli added a comment -

        Suppose I have a dummy Fragmenter that returns isNewFragment() = true on F and K. I have the text "A b c d e. F g h i j! K l m n o. ". I search for "g" hence the Highlighter will pick the middle fragment. The fragment it returns is ". F <b>g</b> h i j". I was wondering if the correct fragment should be " F <b>g</b> h i j" or even "F <b>g</b> h i j" without the starting whitespace. The starting "." is a bug and matches the issue LUCENE-2587. I was wondering if, instead, the starting whitespace at the beginning of the fragment is correct or not. And, in the end, if I understood correctly the issue LUCENE-2587, but I think so. Thanks in advance to anyone who wants to answer. RM

        Show
        Roberto Minelli added a comment - Suppose I have a dummy Fragmenter that returns isNewFragment() = true on F and K. I have the text "A b c d e. F g h i j! K l m n o. ". I search for "g" hence the Highlighter will pick the middle fragment. The fragment it returns is ". F <b>g</b> h i j". I was wondering if the correct fragment should be " F <b>g</b> h i j" or even "F <b>g</b> h i j" without the starting whitespace. The starting "." is a bug and matches the issue LUCENE-2587 . I was wondering if, instead, the starting whitespace at the beginning of the fragment is correct or not. And, in the end, if I understood correctly the issue LUCENE-2587 , but I think so. Thanks in advance to anyone who wants to answer. RM
        Hide
        Terje Eggestad added a comment -

        Sorry for the delay in responding, it's been a year since I looked at this...

        I don't have a complete test case as of right now, but I'll upload my fragmenter.
        The code is a bit messy, but works by finding all fragment boundaries on start(text,tokenstream)
        The algorithm is a s follows

        • first find paragraphs (marked as "\n\n")
        • then chop up paragraphs on "." as close to max length as possible
        • if fragments(sentences) are too long, try breaking on "," or ";", else on white space

        The problem comes since the org.apache.lucene.search.highlight.Highlighter uses the end char offset of the last token in the previous fragment, +1 , as the start char offset in the TextFragment.textStartPos
        As I'm rereading the Highligther.java code now, this is "hidden" beneath the length calculations of a fragment, which is is calculated as: (end char offset of last token) - (start char offset of first token), hence the stuff in between which is not in a token ultimately gets prepended on the resulting text.

        I ultimately did a workaround by replacing leading ".", ",", "?", "!", then a trim() on the String Highlighter.getBestFragment() returns...

        Show
        Terje Eggestad added a comment - Sorry for the delay in responding, it's been a year since I looked at this... I don't have a complete test case as of right now, but I'll upload my fragmenter. The code is a bit messy, but works by finding all fragment boundaries on start(text,tokenstream) The algorithm is a s follows first find paragraphs (marked as "\n\n") then chop up paragraphs on "." as close to max length as possible if fragments(sentences) are too long, try breaking on "," or ";", else on white space The problem comes since the org.apache.lucene.search.highlight.Highlighter uses the end char offset of the last token in the previous fragment, +1 , as the start char offset in the TextFragment.textStartPos As I'm rereading the Highligther.java code now, this is "hidden" beneath the length calculations of a fragment, which is is calculated as: (end char offset of last token) - (start char offset of first token), hence the stuff in between which is not in a token ultimately gets prepended on the resulting text. I ultimately did a workaround by replacing leading ".", ",", "?", "!", then a trim() on the String Highlighter.getBestFragment() returns...
        Hide
        Terje Eggestad added a comment -

        My sentence fragmenter

        Show
        Terje Eggestad added a comment - My sentence fragmenter
        Hide
        Roberto Minelli added a comment -

        I submit a patch. I do not have time now to test it with your Fragmenter. I think the patch should solve the issue anyway. Waiting some feedbacks.

        Show
        Roberto Minelli added a comment - I submit a patch. I do not have time now to test it with your Fragmenter. I think the patch should solve the issue anyway. Waiting some feedbacks.
        Hide
        Roberto Minelli added a comment -

        This patch should solve the issue.

        Show
        Roberto Minelli added a comment - This patch should solve the issue.
        Hide
        Steve Rowe added a comment -

        Hi Terje,

        Can you upload your IMSentenceFragmenter.java file again, but this time click on the radio button next to "Grant license to ASF for inclusion in ASF works (as per the Apache License §5)"?

        Thanks,
        Steve

        Show
        Steve Rowe added a comment - Hi Terje, Can you upload your IMSentenceFragmenter.java file again, but this time click on the radio button next to "Grant license to ASF for inclusion in ASF works (as per the Apache License §5)"? Thanks, Steve
        Hide
        Terje Eggestad added a comment -

        Sentence Fragmenter

        Show
        Terje Eggestad added a comment - Sentence Fragmenter
        Hide
        Terje Eggestad added a comment -

        Tests for Sentence Fragmenter

        Show
        Terje Eggestad added a comment - Tests for Sentence Fragmenter
        Hide
        Terje Eggestad added a comment -

        Hi Steve

        Done.
        You want to include it in future releases? Do me a favor and clean it up a bit so I don't look completely incompetent
        You'd need to add an option for paragraph demarcation.
        Also added the Test class for the Fragmenter. The test text need to be replaced, it's an excerpt from a copyrighted news paper article...
        (You should have an english text anyway)

        TJ

        Show
        Terje Eggestad added a comment - Hi Steve Done. You want to include it in future releases? Do me a favor and clean it up a bit so I don't look completely incompetent You'd need to add an option for paragraph demarcation. Also added the Test class for the Fragmenter. The test text need to be replaced, it's an excerpt from a copyrighted news paper article... (You should have an english text anyway) TJ
        Hide
        Steve Rowe added a comment - - edited

        Hi Terje,

        Thanks for posting your code and a test - I'll take a look. My initial interest is seeing how your code triggers the highlighting bug. I think the ideal way to proceed is create a test fragmenter based on but simpler than your fragmenter, to trigger the bug, and then fix the bug. Roberto may have already essentially done all of this - I haven't had a chance to look at his patch yet.

        Once that's all done, I think it's appropriate to create a new issue for your sentence fragmenter, since that would be a great addition to Lucene!

        Thanks again,
        Steve

        Show
        Steve Rowe added a comment - - edited Hi Terje, Thanks for posting your code and a test - I'll take a look. My initial interest is seeing how your code triggers the highlighting bug. I think the ideal way to proceed is create a test fragmenter based on but simpler than your fragmenter, to trigger the bug, and then fix the bug. Roberto may have already essentially done all of this - I haven't had a chance to look at his patch yet. Once that's all done, I think it's appropriate to create a new issue for your sentence fragmenter, since that would be a great addition to Lucene! Thanks again, Steve
        Hide
        Roberto Minelli added a comment -

        Yes. I actually created a dummy Fragmenter on-the-fly in my test case. Did someone looked at the patch??

        Show
        Roberto Minelli added a comment - Yes. I actually created a dummy Fragmenter on-the-fly in my test case. Did someone looked at the patch??
        Hide
        Steve Rowe added a comment -

        Hi Roberto,

        I've looked at your patch, and you've done a good job of finding the problem, fixing it, and providing a test that fails before your fix and succeeds afterward.

        However, I see a problem. Your fix changes fragments' beginning offsets to exclude preceding inter-fragment non-tokenized characters, but this causes adjacent fragments to be non-contiguous (contiguity is defined as frag1.end==frag2.start). As a result, your fix prevents adjacent fragments from being merged. (See mergeContiguousFragments(), which is called from getBestTextFragments().)

        Ideally, another test should be added that checks for the problem your current fix causes (adjacency != contiguity), and then your fix should be changed to enable contiguous fragment merging.

        Show
        Steve Rowe added a comment - Hi Roberto, I've looked at your patch, and you've done a good job of finding the problem, fixing it, and providing a test that fails before your fix and succeeds afterward. However, I see a problem. Your fix changes fragments' beginning offsets to exclude preceding inter-fragment non-tokenized characters, but this causes adjacent fragments to be non-contiguous (contiguity is defined as frag1.end==frag2.start). As a result, your fix prevents adjacent fragments from being merged. (See mergeContiguousFragments() , which is called from getBestTextFragments() .) Ideally, another test should be added that checks for the problem your current fix causes (adjacency != contiguity), and then your fix should be changed to enable contiguous fragment merging.
        Hide
        Roberto Minelli added a comment -

        I was about to report that! Yesterday night I thought about that. If I have some spare time I will look back at the code and fix it. I think the only wrong thing now is the end of a fragment that should include whitespaces and punctuation.

        Show
        Roberto Minelli added a comment - I was about to report that! Yesterday night I thought about that. If I have some spare time I will look back at the code and fix it. I think the only wrong thing now is the end of a fragment that should include whitespaces and punctuation.
        Hide
        Steve Rowe added a comment -

        I think the only wrong thing now is the end of a fragment that should include whitespaces and punctuation.

        +1

        Show
        Steve Rowe added a comment - I think the only wrong thing now is the end of a fragment that should include whitespaces and punctuation. +1

          People

          • Assignee:
            Unassigned
            Reporter:
            Terje Eggestad
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development