Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2587

Highlighter picks wrong offset for fragment boundaries

Details

    • Bug
    • Status: Open
    • Trivial
    • Resolution: Unresolved
    • 3.0.2
    • None
    • modules/highlighter
    • Java 6 + Lucene 3.0.2

    • New

    Description

      I have written a new Fragmenter since we need fragments for hitlines to be on sentence boundaries and not cross paragraphs.
      When using it with org.apache.lucene.search.highlight.Highlighter, I get hitlines that starts with ". ", "? ", "! "...

      Consider the text "A b c d e. F g h i j! K l m n o. "
      which become the tokenstream : (A) (b) (c) (d) (e) (F) (g) (h) (j) (K) (l) (m) (o)

      If the fragmenter return isNewFragment() = true on F and K and Highlighter pick the middle fragment, lets say we search on "g" the hitline becomes:
      ". F <B>g</B> h i j"

      The reason, it seems, is that the offset to the fragment boundaries found by taking the endOffset of the last token in a fragment ,
      not the startOffset of the first.

      TJ

      Attachments

        1. IMSentenceFragmenter.java
          9 kB
          Terje Eggestad
        2. LUCENE-2587.patch
          3 kB
          Roberto Minelli
        3. TestIMSentenceFragmenter.java
          13 kB
          Terje Eggestad

        Issue Links

          Activity

            People

              Unassigned Unassigned
              terje_eggestad Terje Eggestad
              Votes:
              3 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m