Lucene - Core
  1. Lucene - Core
  2. LUCENE-2587

Highlighter picks wrong offset for fragment boundaries

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Trivial Trivial
    • Resolution: Unresolved
    • Affects Version/s: 3.0.2
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
    • Environment:

      Java 6 + Lucene 3.0.2

    • Lucene Fields:
      New

      Description

      I have written a new Fragmenter since we need fragments for hitlines to be on sentence boundaries and not cross paragraphs.
      When using it with org.apache.lucene.search.highlight.Highlighter, I get hitlines that starts with ". ", "? ", "! "...

      Consider the text "A b c d e. F g h i j! K l m n o. "
      which become the tokenstream : (A) (b) (c) (d) (e) (F) (g) (h) (j) (K) (l) (m) (o)

      If the fragmenter return isNewFragment() = true on F and K and Highlighter pick the middle fragment, lets say we search on "g" the hitline becomes:
      ". F <B>g</B> h i j"

      The reason, it seems, is that the offset to the fragment boundaries found by taking the endOffset of the last token in a fragment ,
      not the startOffset of the first.

      TJ

      1. TestIMSentenceFragmenter.java
        13 kB
        Terje Eggestad
      2. LUCENE-2587.patch
        3 kB
        Roberto Minelli
      3. IMSentenceFragmenter.java
        9 kB
        Terje Eggestad

        Activity

        Terje Eggestad made changes -
        Attachment TestIMSentenceFragmenter.java [ 12502374 ]
        Terje Eggestad made changes -
        Attachment IMSentenceFragmenter.java [ 12502373 ]
        Terje Eggestad made changes -
        Attachment IMSentenceFragmenter.java [ 12502119 ]
        Roberto Minelli made changes -
        Comment [ Do you think that this dummy Fragmenter (http://goo.gl/z9N6p) is ok to replicate the issue? Of course it refers just to the specific case explained in the issue description, returning isNewFragment() = true on F and K.

        I need some opinion to proceed. ]
        Roberto Minelli made changes -
        Attachment LUCENE-2587.patch [ 12502189 ]
        Terje Eggestad made changes -
        Attachment IMSentenceFragmenter.java [ 12502119 ]
        Steve Rowe made changes -
        Labels newdev
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12562465 ] jira [ 12584768 ]
        Mark Thomas made changes -
        Field Original Value New Value
        Workflow jira [ 12517205 ] Default workflow, editable Closed status [ 12562465 ]
        Terje Eggestad created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Terje Eggestad
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development