Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None

      Description

      Mark Harwoods highlighter package is a great contribution to Lucene, I've used it a lot! However, when you have large documents (fields), highlighting can be quite time consuming if you increase the number of bytes to analyze with setMaxDocBytesToAnalyze(int). The default value of 50k is often too low for indexed PDFs etcetera, which results in empty highlight strings.

      This is an alternative approach using term position vectors only to build fragment info objects. Then a StringReader can read the relevant fragments and skip() between them. This is a lot faster. Also, this method uses the entire field for finding the best fragments so you're always guaranteed to get a highlight snippet.

      Because this method only works with fields which have term positions stored one can check if this method works for a particular field using following code (taken from TokenSources.java):

      TermFreqVector tfv = (TermFreqVector) reader.getTermFreqVector(docId, field);
      if (tfv != null && tfv instanceof TermPositionVector)

      { // use FulltextHighlighter }

      else

      { // use standard Highlighter }

      Someone else might find this useful so I'm posting the code here.

      1. FulltextHighlighter.java
        13 kB
        Ronnie Kolehmainen
      2. FulltextHighlighterTest.java
        16 kB
        Ronnie Kolehmainen
      3. svn-diff.patch
        31 kB
        Ronnie Kolehmainen
      4. TokenSources.java
        17 kB
        Ronnie Kolehmainen
      5. TokenSources.java.diff
        12 kB
        Ronnie Kolehmainen
      6. svn-diff.patch
        30 kB
        Ronnie Kolehmainen
      7. FulltextHighlighterTest.java
        15 kB
        Ronnie Kolehmainen
      8. FulltextHighlighter.java
        13 kB
        Ronnie Kolehmainen

        Activity

        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12564122 ] jira [ 12585576 ]
        Mark Thomas made changes -
        Workflow jira [ 12377297 ] Default workflow, editable Closed status [ 12564122 ]
        Uwe Schindler made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Fix Version/s 2.9 [ 12312682 ]
        Resolution Fixed [ 1 ]
        Robert Muir made changes -
        Component/s contrib/highlighter [ 12312096 ]
        Component/s Other [ 12310233 ]
        Ronnie Kolehmainen made changes -
        Attachment svn-diff.patch [ 12339558 ]
        Attachment FulltextHighlighterTest.java [ 12339560 ]
        Attachment FulltextHighlighter.java [ 12339559 ]
        Ronnie Kolehmainen made changes -
        Attachment TokenSources.java.diff [ 12338087 ]
        Attachment TokenSources.java [ 12338086 ]
        Ronnie Kolehmainen made changes -
        Attachment svn-diff.patch [ 12337977 ]
        Ronnie Kolehmainen made changes -
        Attachment FulltextHighlighterTest.java [ 12337976 ]
        Ronnie Kolehmainen made changes -
        Field Original Value New Value
        Attachment FulltextHighlighter.java [ 12337975 ]
        Ronnie Kolehmainen created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Ronnie Kolehmainen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development