• Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
    • Lucene Fields:
      New, Patch Available


      I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.


      TopDocs docs = query, 10 );
      Highlighter h = new Highlighter();
      FieldQuery fq = h.getFieldQuery( query );
      for( ScoreDoc scoreDoc : docs.scoreDocs ){
        // fieldName="content", fragCharSize=100, numFragments=3
        String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
        if( fragments != null ){
          for( String fragment : fragments )
            System.out.println( fragment );


      • fast for large docs
      • supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
      • supports PhraseQuery, phrase-unit highlighting with slops
        q="w1 w2"
        <b>w1 w2</b>
        q="w1 w2"~1
        <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
      • highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
      • easy to apply patch due to independent package (contrib/highlighter2)
      • uses Java 1.5
      • looks query boost to score fragments (currently doesn't see idf, but it should be possible)
      • pluggable FragListBuilder
      • pluggable FragmentsBuilder

      to do:

      • term positions can be unnecessary when phraseHighlight==false
      • collects performance numbers
      1. LUCENE-1522-fix-SIOOBE.patch
        3 kB
        Koji Sekiguchi
      2. LUCENE-1522-multiValued-test.patch
        5 kB
        Koji Sekiguchi
      3. LUCENE-1522.patch
        202 kB
        Koji Sekiguchi
      4. LUCENE-1522.patch
        197 kB
        Michael McCandless
      5. LUCENE-1522.patch
        152 kB
        Koji Sekiguchi
      6. LUCENE-1522.patch
        151 kB
        Koji Sekiguchi
      7. LUCENE-1522.patch
        146 kB
        Koji Sekiguchi
      8. colored-tag-sample.png
        28 kB
        Koji Sekiguchi
      9. LUCENE-1522.patch
        123 kB
        Koji Sekiguchi
      10. LUCENE-1522.patch
        121 kB
        Koji Sekiguchi

        Issue Links


          No work has yet been logged on this issue.


            • Assignee:
              Mark Miller
              Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: