Lucene - Core
  1. Lucene - Core
  2. LUCENE-4816

PassageFormatter in PostingsHighlighter trunk the message returned

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.3, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      NA

    • Lucene Fields:
      New, Patch Available

      Description

      when I try to highlight the word zero [0] in the file : org\apache\lucene\search\postingshighlight\package.html

      the 2 last lines weren't return. There are 4 Passages :
      2-65
      277-434
      434-735
      735-968

      but the length of the file is 984.

      in the file : PassageFormatter.format(...)

      it should return all the original content with the words highlighted.

      PATCH

      need to add this at the end of the method

      // at line : 91 add this
      if(pos<content.length())

      { sb.append(content.substring(pos)); }

      return sb.toString();

      1. BreakIteratorTest.java
        3 kB
        Sebastien Dionne
      2. LUCENE-4816.patch
        7 kB
        Robert Muir
      3. LUCENE-4816.patch
        7 kB
        Michael McCandless
      4. package.html
        0.9 kB
        Sebastien Dionne
      5. PassageFormatter.java
        3 kB
        Sebastien Dionne
      6. PassageFormatter-PATCH.java
        3 kB
        Sebastien Dionne
      7. WholeBreakIterator.java
        2 kB
        Michael McCandless

        Activity

        Hide
        Sebastien Dionne added a comment -

        here the source file modified

        Show
        Sebastien Dionne added a comment - here the source file modified
        Hide
        Michael McCandless added a comment - - edited

        Hmm the attached PassageFormatter.java doesn't have any change vs the 4.1.0 sources? Maybe you attached the wrong version?

        But, I'm not sure this change is correct, because with that change you'll be returning content outside of what the 4 Passages had specified?

        Are you trying to highlight terms in the entire content (not just snippets/extracts)?

        Show
        Michael McCandless added a comment - - edited Hmm the attached PassageFormatter.java doesn't have any change vs the 4.1.0 sources? Maybe you attached the wrong version? But, I'm not sure this change is correct, because with that change you'll be returning content outside of what the 4 Passages had specified? Are you trying to highlight terms in the entire content (not just snippets/extracts)?
        Hide
        Sebastien Dionne added a comment -

        here the file patched

        Show
        Sebastien Dionne added a comment - here the file patched
        Hide
        Sebastien Dionne added a comment -

        here the code that I used to return the text highlighted

        PostingsHighlighter highlighter = new PostingsHighlighter();
        Query query = new TermQuery(new Term("body", term));
        TopDocs topDocs = searcher.search(query, 1);
        String highlights[] = highlighter.highlight("body", query, searcher, topDocs, 50);

        I won't post the content here of the file : package.html (just go see the attachments)

        The return String, stopped at :

        Highlighter implementation that uses offsets from postings lists.

        it missed the last html closing tags.

        if I use the version patched, it return all the content.

        Show
        Sebastien Dionne added a comment - here the code that I used to return the text highlighted PostingsHighlighter highlighter = new PostingsHighlighter(); Query query = new TermQuery(new Term("body", term)); TopDocs topDocs = searcher.search(query, 1); String highlights[] = highlighter.highlight("body", query, searcher, topDocs, 50); I won't post the content here of the file : package.html (just go see the attachments) The return String, stopped at : Highlighter implementation that uses offsets from postings lists. it missed the last html closing tags. if I use the version patched, it return all the content.
        Hide
        Sebastien Dionne added a comment -

        the example that was in problem. The ending html tags aren't return.

        The return string stopped at the char (968), but the original content containts 984 chars.

        Show
        Sebastien Dionne added a comment - the example that was in problem. The ending html tags aren't return. The return string stopped at the char (968), but the original content containts 984 chars.
        Hide
        Robert Muir added a comment -

        I don't think it should be returning any html tags here. This highlighter breaks the document into sentences. each sentence is scored and the top-N matching sentences are returned.

        it doesn't know about or deal with html tags, nor does it return documents.

        the patch here would return the whole rest of the document after the highlighted portion. I dont think we should do this.

        Show
        Robert Muir added a comment - I don't think it should be returning any html tags here. This highlighter breaks the document into sentences. each sentence is scored and the top-N matching sentences are returned. it doesn't know about or deal with html tags, nor does it return documents. the patch here would return the whole rest of the document after the highlighted portion. I dont think we should do this.
        Hide
        Sebastien Dionne added a comment -

        That the unit test provided in the Javadoc of BeakerIterator. It does return everything that it reads.

        Show
        Sebastien Dionne added a comment - That the unit test provided in the Javadoc of BeakerIterator. It does return everything that it reads.
        Hide
        Sebastien Dionne added a comment -

        if I understand, the PostingsHighlighter, could return a portion of the text scan ? In my call, I ask for 50 Passages Max, he did stop after 4 passages, before the text [0] wasn't found anymore. That's fine here. but for me the point of a Highlighter, is to Highlighter all the match (until the max occurences is found , here 50). and return the text scanned with its highlights.

        I see that Highlighter like a "FIND ALL" that will highlights the all the occurences of the word.

        by reading the javadoc, it's what I though this Highlighter was suppose to do. If not, I'll just have to create a new one that will do that.

        Show
        Sebastien Dionne added a comment - if I understand, the PostingsHighlighter, could return a portion of the text scan ? In my call, I ask for 50 Passages Max, he did stop after 4 passages, before the text [0] wasn't found anymore. That's fine here. but for me the point of a Highlighter, is to Highlighter all the match (until the max occurences is found , here 50). and return the text scanned with its highlights. I see that Highlighter like a "FIND ALL" that will highlights the all the occurences of the word. by reading the javadoc, it's what I though this Highlighter was suppose to do. If not, I'll just have to create a new one that will do that.
        Hide
        Michael McCandless added a comment -

        I think a simple way to do what you want is make a "fake" BreakIterator that returns the entire content as a single chunk?

        I've done this and it works well, for small fields anyway (title, author).

        I'll attach my version ...

        Show
        Michael McCandless added a comment - I think a simple way to do what you want is make a "fake" BreakIterator that returns the entire content as a single chunk? I've done this and it works well, for small fields anyway (title, author). I'll attach my version ...
        Hide
        Michael McCandless added a comment -

        Patch.

        Show
        Michael McCandless added a comment - Patch.
        Hide
        Robert Muir added a comment -

        I see that Highlighter like a "FIND ALL" that will highlights the all the occurences of the word.
        by reading the javadoc, it's what I though this Highlighter was suppose to do. If not, I'll just have to create a new one that will do that.

        thats not the intent of any highlighter in lucene really (in their default modes of operation).

        using topn=50 and things like that to implement a "FIND ALL" is not an ideal way to try to tune highlighting if the field is very short (like a title field).

        Show
        Robert Muir added a comment - I see that Highlighter like a "FIND ALL" that will highlights the all the occurences of the word. by reading the javadoc, it's what I though this Highlighter was suppose to do. If not, I'll just have to create a new one that will do that. thats not the intent of any highlighter in lucene really (in their default modes of operation). using topn=50 and things like that to implement a "FIND ALL" is not an ideal way to try to tune highlighting if the field is very short (like a title field).
        Hide
        Robert Muir added a comment -

        I think a simple way to do what you want is make a "fake" BreakIterator that returns the entire content as a single chunk?

        I think its a good idea. maybe postingshighlighter can allow for breakiterator=null (or have a static method to get one of these crazy wholebreakiterators).

        I would prefer the former really, as then its just implementation detail. But i'm worried that 'null' could mean that this makes it the new default: since IDEs love to auto-populate parameters with null.

        Show
        Robert Muir added a comment - I think a simple way to do what you want is make a "fake" BreakIterator that returns the entire content as a single chunk? I think its a good idea. maybe postingshighlighter can allow for breakiterator=null (or have a static method to get one of these crazy wholebreakiterators). I would prefer the former really, as then its just implementation detail. But i'm worried that 'null' could mean that this makes it the new default: since IDEs love to auto-populate parameters with null.
        Hide
        Sebastien Dionne added a comment -

        Robert, I'm agree with you, but it was a test to see if it will return everything. I think that I'll have to create my own. I though it was a bug because to content was trunked. Even if I think it's still a bug At least the javadoc should say that it will stop at the last Passage found, like that the user will know.

        Show
        Sebastien Dionne added a comment - Robert, I'm agree with you, but it was a test to see if it will return everything. I think that I'll have to create my own. I though it was a bug because to content was trunked. Even if I think it's still a bug At least the javadoc should say that it will stop at the last Passage found, like that the user will know.
        Hide
        Michael McCandless added a comment -

        Patch, allowing null for BreakIterator to mean "highlight all content as a single Passage".

        Show
        Michael McCandless added a comment - Patch, allowing null for BreakIterator to mean "highlight all content as a single Passage".
        Hide
        Robert Muir added a comment -

        Patch is good: i think I will look at this breakiterator as it is total evil (it will loop forever and never return Breakiterator.END) and see if it can be improved, but its not a huge deal.

        Show
        Robert Muir added a comment - Patch is good: i think I will look at this breakiterator as it is total evil (it will loop forever and never return Breakiterator.END) and see if it can be improved, but its not a huge deal.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1456115

        LUCENE-4816: passing null to PostingsHighlighter means the entire content is treated as a single Passage

        Show
        Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1456115 LUCENE-4816 : passing null to PostingsHighlighter means the entire content is treated as a single Passage
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1456116

        LUCENE-4816: passing null to PostingsHighlighter means the entire content is treated as a single Passage

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1456116 LUCENE-4816 : passing null to PostingsHighlighter means the entire content is treated as a single Passage
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1456597

        LUCENE-4816: expose bm25 parameters in PassageScorer

        Show
        Commit Tag Bot added a comment - [trunk commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1456597 LUCENE-4816 : expose bm25 parameters in PassageScorer
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Robert Muir
        http://svn.apache.org/viewvc?view=revision&revision=1456599

        LUCENE-4816: expose bm25 parameters in PassageScorer

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Robert Muir http://svn.apache.org/viewvc?view=revision&revision=1456599 LUCENE-4816 : expose bm25 parameters in PassageScorer
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Sebastien Dionne
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development