Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.

      usage:

      TopDocs docs = searcher.search( query, 10 );
      Highlighter h = new Highlighter();
      FieldQuery fq = h.getFieldQuery( query );
      for( ScoreDoc scoreDoc : docs.scoreDocs ){
        // fieldName="content", fragCharSize=100, numFragments=3
        String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
        if( fragments != null ){
          for( String fragment : fragments )
            System.out.println( fragment );
        }
      }
      

      features:

      • fast for large docs
      • supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
      • supports PhraseQuery, phrase-unit highlighting with slops
        q="w1 w2"
        <b>w1 w2</b>
        ---------------
        q="w1 w2"~1
        <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
        
      • highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
      • easy to apply patch due to independent package (contrib/highlighter2)
      • uses Java 1.5
      • looks query boost to score fragments (currently doesn't see idf, but it should be possible)
      • pluggable FragListBuilder
      • pluggable FragmentsBuilder

      to do:

      • term positions can be unnecessary when phraseHighlight==false
      • collects performance numbers
      1. LUCENE-1522-fix-SIOOBE.patch
        3 kB
        Koji Sekiguchi
      2. LUCENE-1522-multiValued-test.patch
        5 kB
        Koji Sekiguchi
      3. LUCENE-1522.patch
        202 kB
        Koji Sekiguchi
      4. LUCENE-1522.patch
        197 kB
        Michael McCandless
      5. LUCENE-1522.patch
        152 kB
        Koji Sekiguchi
      6. LUCENE-1522.patch
        151 kB
        Koji Sekiguchi
      7. LUCENE-1522.patch
        146 kB
        Koji Sekiguchi
      8. colored-tag-sample.png
        28 kB
        Koji Sekiguchi
      9. LUCENE-1522.patch
        123 kB
        Koji Sekiguchi
      10. LUCENE-1522.patch
        121 kB
        Koji Sekiguchi

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development