Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.

      usage:

      TopDocs docs = searcher.search( query, 10 );
      Highlighter h = new Highlighter();
      FieldQuery fq = h.getFieldQuery( query );
      for( ScoreDoc scoreDoc : docs.scoreDocs ){
        // fieldName="content", fragCharSize=100, numFragments=3
        String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
        if( fragments != null ){
          for( String fragment : fragments )
            System.out.println( fragment );
        }
      }
      

      features:

      • fast for large docs
      • supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
      • supports PhraseQuery, phrase-unit highlighting with slops
        q="w1 w2"
        <b>w1 w2</b>
        ---------------
        q="w1 w2"~1
        <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
        
      • highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
      • easy to apply patch due to independent package (contrib/highlighter2)
      • uses Java 1.5
      • looks query boost to score fragments (currently doesn't see idf, but it should be possible)
      • pluggable FragListBuilder
      • pluggable FragmentsBuilder

      to do:

      • term positions can be unnecessary when phraseHighlight==false
      • collects performance numbers

        Attachments

        1. colored-tag-sample.png
          28 kB
          Koji Sekiguchi
        2. LUCENE-1522.patch
          202 kB
          Koji Sekiguchi
        3. LUCENE-1522.patch
          197 kB
          Michael McCandless
        4. LUCENE-1522.patch
          152 kB
          Koji Sekiguchi
        5. LUCENE-1522.patch
          151 kB
          Koji Sekiguchi
        6. LUCENE-1522.patch
          146 kB
          Koji Sekiguchi
        7. LUCENE-1522.patch
          123 kB
          Koji Sekiguchi
        8. LUCENE-1522.patch
          121 kB
          Koji Sekiguchi
        9. LUCENE-1522-fix-SIOOBE.patch
          3 kB
          Koji Sekiguchi
        10. LUCENE-1522-multiValued-test.patch
          5 kB
          Koji Sekiguchi

          Issue Links

            Activity

              People

              • Assignee:
                markrmiller@gmail.com Mark Miller
                Reporter:
                koji Koji Sekiguchi
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: