Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2587

Highlighter picks wrong offset for fragment boundaries

    Details

    • Type: Bug
    • Status: Open
    • Priority: Trivial
    • Resolution: Unresolved
    • Affects Version/s: 3.0.2
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
    • Environment:

      Java 6 + Lucene 3.0.2

    • Lucene Fields:
      New

      Description

      I have written a new Fragmenter since we need fragments for hitlines to be on sentence boundaries and not cross paragraphs.
      When using it with org.apache.lucene.search.highlight.Highlighter, I get hitlines that starts with ". ", "? ", "! "...

      Consider the text "A b c d e. F g h i j! K l m n o. "
      which become the tokenstream : (A) (b) (c) (d) (e) (F) (g) (h) (j) (K) (l) (m) (o)

      If the fragmenter return isNewFragment() = true on F and K and Highlighter pick the middle fragment, lets say we search on "g" the hitline becomes:
      ". F <B>g</B> h i j"

      The reason, it seems, is that the offset to the fragment boundaries found by taking the endOffset of the last token in a fragment ,
      not the startOffset of the first.

      TJ

        Attachments

        1. LUCENE-2587.patch
          3 kB
          Roberto Minelli
        2. IMSentenceFragmenter.java
          9 kB
          Terje Eggestad
        3. TestIMSentenceFragmenter.java
          13 kB
          Terje Eggestad

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              terje_eggestad Terje Eggestad
            • Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: