Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.

      usage:

      TopDocs docs = searcher.search( query, 10 );
      Highlighter h = new Highlighter();
      FieldQuery fq = h.getFieldQuery( query );
      for( ScoreDoc scoreDoc : docs.scoreDocs ){
        // fieldName="content", fragCharSize=100, numFragments=3
        String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
        if( fragments != null ){
          for( String fragment : fragments )
            System.out.println( fragment );
        }
      }
      

      features:

      • fast for large docs
      • supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
      • supports PhraseQuery, phrase-unit highlighting with slops
        q="w1 w2"
        <b>w1 w2</b>
        ---------------
        q="w1 w2"~1
        <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
        
      • highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
      • easy to apply patch due to independent package (contrib/highlighter2)
      • uses Java 1.5
      • looks query boost to score fragments (currently doesn't see idf, but it should be possible)
      • pluggable FragListBuilder
      • pluggable FragmentsBuilder

      to do:

      • term positions can be unnecessary when phraseHighlight==false
      • collects performance numbers
      1. LUCENE-1522-fix-SIOOBE.patch
        3 kB
        Koji Sekiguchi
      2. LUCENE-1522-multiValued-test.patch
        5 kB
        Koji Sekiguchi
      3. LUCENE-1522.patch
        202 kB
        Koji Sekiguchi
      4. LUCENE-1522.patch
        197 kB
        Michael McCandless
      5. LUCENE-1522.patch
        152 kB
        Koji Sekiguchi
      6. LUCENE-1522.patch
        151 kB
        Koji Sekiguchi
      7. LUCENE-1522.patch
        146 kB
        Koji Sekiguchi
      8. colored-tag-sample.png
        28 kB
        Koji Sekiguchi
      9. LUCENE-1522.patch
        123 kB
        Koji Sekiguchi
      10. LUCENE-1522.patch
        121 kB
        Koji Sekiguchi

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Thanks Koji!

          r804786

          Show
          Mark Miller added a comment - Thanks Koji! r804786
          Hide
          Koji Sekiguchi added a comment -

          The patch includes the fix and a test case.

          Show
          Koji Sekiguchi added a comment - The patch includes the fix and a test case.
          Hide
          Koji Sekiguchi added a comment -

          There is a bug in BaseFragmentsBuilder. When the highlighting field is not stored, StringIndexOutOfBoundException will be thrown. I'd like to reopen this issue so the fix can be included in 2.9. I'll post the fix soon.

          Show
          Koji Sekiguchi added a comment - There is a bug in BaseFragmentsBuilder. When the highlighting field is not stored, StringIndexOutOfBoundException will be thrown. I'd like to reopen this issue so the fix can be included in 2.9. I'll post the fix soon.
          Hide
          Michael McCandless added a comment -

          Thanks Koji, I'll commit this (turning on the tests). Always nice to add more tests!

          Show
          Michael McCandless added a comment - Thanks Koji, I'll commit this (turning on the tests). Always nice to add more tests!
          Hide
          Koji Sekiguchi added a comment -

          Now LUCENE-1448 and LUCENE-1759 has been committed, these tests should be uncommentted. All tests pass.

          Show
          Koji Sekiguchi added a comment - Now LUCENE-1448 and LUCENE-1759 has been committed, these tests should be uncommentted. All tests pass.
          Hide
          Michael McCandless added a comment -

          Thanks Koji!

          Show
          Michael McCandless added a comment - Thanks Koji!
          Hide
          Michael McCandless added a comment -

          Patch looks good Koji! I will commit shortly. Thanks for persisting here

          Show
          Michael McCandless added a comment - Patch looks good Koji! I will commit shortly. Thanks for persisting here
          Hide
          Koji Sekiguchi added a comment -

          renamed Highlighter2 to FastVectorHighlighter

          Show
          Koji Sekiguchi added a comment - renamed Highlighter2 to FastVectorHighlighter
          Hide
          Koji Sekiguchi added a comment -

          FastVectorHighlighter?

          +1 from me. I'll attach a renamed version of the patch tomorrow, since I have to be out soon and back late tonight.

          Show
          Koji Sekiguchi added a comment - FastVectorHighlighter? +1 from me. I'll attach a renamed version of the patch tomorrow, since I have to be out soon and back late tonight.
          Hide
          Mark Miller added a comment -

          Hmmm... is it still Highlighter2?

          Can we come up with a better name that kind of gives a feel for the difference between this and the other highlighter?

          FastVectorHighlighter?

          Well - maybe that will triger a better idea from someone else anyway ...

          Show
          Mark Miller added a comment - Hmmm... is it still Highlighter2? Can we come up with a better name that kind of gives a feel for the difference between this and the other highlighter? FastVectorHighlighter? Well - maybe that will triger a better idea from someone else anyway ...
          Hide
          Michael McCandless added a comment -

          You should add highligther2 also to the source dirs and the package prefixes for the combined all-javadocs in the main build.xml.

          Ahh right. I'll fix that!

          the main site docs also reference the contribs in developer-resources.xml.

          Whoa OK I'll update that as well! Thanks for the pointers

          Show
          Michael McCandless added a comment - You should add highligther2 also to the source dirs and the package prefixes for the combined all-javadocs in the main build.xml. Ahh right. I'll fix that! the main site docs also reference the contribs in developer-resources.xml. Whoa OK I'll update that as well! Thanks for the pointers
          Hide
          Uwe Schindler added a comment -

          Mike: As I have done these type of site updates many times because of TrieRange:

          You should add highligther2 also to the source dirs and the package prefixes for the combined all-javadocs in the main build.xml. And as far as I know, also the main site docs also reference the contribs in developer-resources.xml.

          Everything else looks good!

          Show
          Uwe Schindler added a comment - Mike: As I have done these type of site updates many times because of TrieRange: You should add highligther2 also to the source dirs and the package prefixes for the combined all-javadocs in the main build.xml. And as far as I know, also the main site docs also reference the contribs in developer-resources.xml. Everything else looks good!
          Hide
          Michael McCandless added a comment -

          Patch looks good, thanks Koji! All tests now pass for me.

          I attached a new patch, that updates the website w/ references to highlighter2. I plan to commit in a day or two.

          Show
          Michael McCandless added a comment - Patch looks good, thanks Koji! All tests now pass for me. I attached a new patch, that updates the website w/ references to highlighter2. I plan to commit in a day or two.
          Hide
          Koji Sekiguchi added a comment -

          Thank you for your advice, Michael.

          because they test multi-valued fields

          Right.

          I commentted out the part of LUCENE-1448 dependencies, ie, test cases for multi valued:

          /*
           * ----------------------------------
           *  THIS TEST DEPENDS ON LUCENE-1448
           *  UNCOMMENT WHEN IT IS COMMITTED.
           * ----------------------------------
            public void testXXXXXXXMVB() throws Exception {
                :
            }
          */
          

          All tests pass without LUCENE-1448.

          Show
          Koji Sekiguchi added a comment - Thank you for your advice, Michael. because they test multi-valued fields Right. I commentted out the part of LUCENE-1448 dependencies, ie, test cases for multi valued: /* * ---------------------------------- * THIS TEST DEPENDS ON LUCENE-1448 * UNCOMMENT WHEN IT IS COMMITTED. * ---------------------------------- public void testXXXXXXXMVB() throws Exception { : } */ All tests pass without LUCENE-1448 .
          Hide
          Michael McCandless added a comment -

          Is it possible to decouple this issue from LUCENE-1448 (whose latest patch is quite old and no longer applies and may not make it into 2.9)? Ie, is it only the unit tests that rely on this fix to pass (because they test multi-valued fields)?

          Show
          Michael McCandless added a comment - Is it possible to decouple this issue from LUCENE-1448 (whose latest patch is quite old and no longer applies and may not make it into 2.9)? Ie, is it only the unit tests that rely on this fix to pass (because they test multi-valued fields)?
          Hide
          Koji Sekiguchi added a comment -

          I added package.html to explain the algorithm of H2.

          Show
          Koji Sekiguchi added a comment - I added package.html to explain the algorithm of H2.
          Hide
          Koji Sekiguchi added a comment -

          Added more comment and index time synonym support and its test cases.

          Show
          Koji Sekiguchi added a comment - Added more comment and index time synonym support and its test cases.
          Hide
          Michael McCandless added a comment -

          I think this is an unrealistic requirement in some cases (e.g. AND queries).

          I agree.

          Show
          Michael McCandless added a comment - I think this is an unrealistic requirement in some cases (e.g. AND queries). I agree.
          Hide
          Michael Busch added a comment -

          (Meaning, if you were to copy & paste the full excerpt you are looking at, index it as a document, would your current search match it).

          I think this is an unrealistic requirement in some cases (e.g. AND queries). I agree it makes sense for phrases to show them entirely in a fragment (even if that means not to show the beginning of a sentence). But often you have only one or two lines of text to display an extract. Then it might be a better choice to show two decently sized fragments with some context around the highlighted terms, rather than showing e.g. 4 short fragments just to show all 4 highlighted query terms (e.g. for query '+a +b +c +d')

          Show
          Michael Busch added a comment - (Meaning, if you were to copy & paste the full excerpt you are looking at, index it as a document, would your current search match it). I think this is an unrealistic requirement in some cases (e.g. AND queries). I agree it makes sense for phrases to show them entirely in a fragment (even if that means not to show the beginning of a sentence). But often you have only one or two lines of text to display an extract. Then it might be a better choice to show two decently sized fragments with some context around the highlighted terms, rather than showing e.g. 4 short fragments just to show all 4 highlighted query terms (e.g. for query '+a +b +c +d')
          Hide
          Mark Miller added a comment -

          But that's really quite a serious problem; it's the kind that
          immediately erodes user's trust. Though if this user had used
          SpanScorer it would have been fixed (right?).

          Right - my point was more that it was a common complaint and has been solved in one way or another for a long time. Even back when that post occured, there was a JIRA highlighter that worked with phrase queries I think. There have been at least one or two besides the SpanScorer.

          Is there any reason not to use SpanScorer (vs QueryScorer)?

          It is slower when working with position sensitive clauses - because it actually does some work. For non position sensitive terms, its the same speed as the standard. Makes sense to me to always use it, but if you don't care and want every term highlighted, why pay the price I guess...

          Well... I'd still like to explore some way to better integrate w/ core
          (just don't have enough time, but maybe if I keep talking about it
          here, someone else will get the itch + time .

          Right - don't get me wrong - I was just getting thoughts in my head down. These types of brain dumps you higher level guys do def leads to work getting done - the SpanScorer came directly from these types of discussions, and quite a bit later - the original discussion happened before my time.

          Well this is open source after all. Things get "naturally
          prioritized".

          A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters.

          Yeah also another common theme in open-source development, though it's
          in good company: evolution and capitalism share the same "flaw".

          Right. I suppose I was just suggesting that something more practical might make more sense (more musing than suggesting). And practical in terms of how much activity we have seen in the highlighter area (fairly low, and not usually to the extent needed to get something committed and in use).

          And the split work on the highlighters is fine - but if we had the right highlighter base, more work could have been concentrated on the highlighter thats most used. Not really a complaint, but idea for the future. If we can get something better going, perhaps we can get to the point were people work with the current implementation rather than creating a new one every time.

          Show
          Mark Miller added a comment - But that's really quite a serious problem; it's the kind that immediately erodes user's trust. Though if this user had used SpanScorer it would have been fixed (right?). Right - my point was more that it was a common complaint and has been solved in one way or another for a long time. Even back when that post occured, there was a JIRA highlighter that worked with phrase queries I think. There have been at least one or two besides the SpanScorer. Is there any reason not to use SpanScorer (vs QueryScorer)? It is slower when working with position sensitive clauses - because it actually does some work. For non position sensitive terms, its the same speed as the standard. Makes sense to me to always use it, but if you don't care and want every term highlighted, why pay the price I guess... Well... I'd still like to explore some way to better integrate w/ core (just don't have enough time, but maybe if I keep talking about it here, someone else will get the itch + time . Right - don't get me wrong - I was just getting thoughts in my head down. These types of brain dumps you higher level guys do def leads to work getting done - the SpanScorer came directly from these types of discussions, and quite a bit later - the original discussion happened before my time. Well this is open source after all. Things get "naturally prioritized". A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters. Yeah also another common theme in open-source development, though it's in good company: evolution and capitalism share the same "flaw". Right. I suppose I was just suggesting that something more practical might make more sense (more musing than suggesting). And practical in terms of how much activity we have seen in the highlighter area (fairly low, and not usually to the extent needed to get something committed and in use). And the split work on the highlighters is fine - but if we had the right highlighter base, more work could have been concentrated on the highlighter thats most used. Not really a complaint, but idea for the future. If we can get something better going, perhaps we can get to the point were people work with the current implementation rather than creating a new one every time.
          Hide
          Michael McCandless added a comment -

          I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly

          But that's really quite a serious problem; it's the kind that
          immediately erodes user's trust. Though if this user had used
          SpanScorer it would have been fixed (right?).

          Is there any reason not to use SpanScorer (vs QueryScorer)?

          The "final inch" (search UI) is exceptionally important!

          When users see the PhraseQuery look right, I havn't seen any other repeated complaints really.

          OK.

          And I think we have positional solved fairly well with the current API - its just too darn slow.

          Well... I'd still like to explore some way to better integrate w/ core
          (just don't have enough time, but maybe if I keep talking about it
          here, someone else will get the itch + time .

          I think an IndexReader impl around loaded TermVectors can get us OK
          performance (no re-analysis nor linear scan of resynthesized
          TokenStream).

          Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool).

          Well this is open source after all. Things get "naturally
          prioritized".

          A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters.

          Yeah also another common theme in open-source development, though it's
          in good company: evolution and capitalism share the same "flaw".

          Show
          Michael McCandless added a comment - I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly But that's really quite a serious problem; it's the kind that immediately erodes user's trust. Though if this user had used SpanScorer it would have been fixed (right?). Is there any reason not to use SpanScorer (vs QueryScorer)? The "final inch" (search UI) is exceptionally important! When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. OK. And I think we have positional solved fairly well with the current API - its just too darn slow. Well... I'd still like to explore some way to better integrate w/ core (just don't have enough time, but maybe if I keep talking about it here, someone else will get the itch + time . I think an IndexReader impl around loaded TermVectors can get us OK performance (no re-analysis nor linear scan of resynthesized TokenStream). Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool). Well this is open source after all. Things get "naturally prioritized". A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters. Yeah also another common theme in open-source development, though it's in good company: evolution and capitalism share the same "flaw".
          Hide
          Mark Miller added a comment - - edited

          I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly. That was/is a common occurrence and you can find tons of examples. There are one or two JIRA highlighters that address it, and the their is the Span highlighter (more interestingly, there is a link to the birth of the Span highlighter idea on that page - thanks M. Harwood).

          When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. While it would be nice to match boolean logic fully, I almost don't think its worth the effort. You likely have an interest in those terms anyway - its not a given that the terms that caused the match (non positional) matter. I have not seen a complaint on that one - mostly just positional type stuff. And I think we have positional solved fairly well with the current API - its just too darn slow. Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool).

          When I wrote the SpanScorer, many times it was discussed how things should really be done. Most methods involved working with core - but what has been there for a couple years now is the SpanScorer that plugs into the current highlighter API and nothing else has made any progress. Not really an argument, just kind of thinking out loud at this point...

          I'm all for improving the speed and accuracy of the highlighter at the end of the day, but its a tall order considering how much attention the Highlighter has managed to receive in the past. Its large on ideas and low on sweat.

          edit
          A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters.

          Show
          Mark Miller added a comment - - edited I think you are reading more into that than I see - that guy is just frustrated that PhraseQueries don't highlight correctly. That was/is a common occurrence and you can find tons of examples. There are one or two JIRA highlighters that address it, and the their is the Span highlighter (more interestingly, there is a link to the birth of the Span highlighter idea on that page - thanks M. Harwood). When users see the PhraseQuery look right, I havn't seen any other repeated complaints really. While it would be nice to match boolean logic fully, I almost don't think its worth the effort. You likely have an interest in those terms anyway - its not a given that the terms that caused the match (non positional) matter. I have not seen a complaint on that one - mostly just positional type stuff. And I think we have positional solved fairly well with the current API - its just too darn slow. Not that I am against things being sweet and perfect, and getting exact matches, but there has been lots of talk in the past about integrating the highlighter into core and making things really fast and efficient - and generally it comes down to what work actually gets done (and all this stuff ends up at the hard end of the pool). When I wrote the SpanScorer, many times it was discussed how things should really be done. Most methods involved working with core - but what has been there for a couple years now is the SpanScorer that plugs into the current highlighter API and nothing else has made any progress. Not really an argument, just kind of thinking out loud at this point... I'm all for improving the speed and accuracy of the highlighter at the end of the day, but its a tall order considering how much attention the Highlighter has managed to receive in the past. Its large on ideas and low on sweat. edit A lot of the sweat that is given has been fragmented by the 3 or 4 alternate highlighters.
          Hide
          Michael McCandless added a comment -

          Randomly searching in Google I came across this:

          http://stackoverflow.com/questions/82151/is-there-a-fast-accurate-highlighter-for-lucene

          ...which emphasizes how important it is that the highlighter only highlight "matching" fragdocs when possible.

          (Meaning, if you were to copy & paste the full excerpt you are looking at, index it as a document, would your current search match it).

          Show
          Michael McCandless added a comment - Randomly searching in Google I came across this: http://stackoverflow.com/questions/82151/is-there-a-fast-accurate-highlighter-for-lucene ...which emphasizes how important it is that the highlighter only highlight "matching" fragdocs when possible. (Meaning, if you were to copy & paste the full excerpt you are looking at, index it as a document, would your current search match it).
          Hide
          David Kaelbling added a comment -

          Hi,

          Our application wants to find and highlight all the hits in a document,
          not just the best one(s). If future highlighters still allowed this,
          even if only by judicious use of subclasses, I would be happy

          Thanks,
          David


          David Kaelbling
          Senior Software Engineer
          Black Duck Software, Inc.

          dkaelbling@blackducksoftware.com
          T +1.781.810.2041
          F +1.781.891.5145

          http://www.blackducksoftware.com

          Show
          David Kaelbling added a comment - Hi, Our application wants to find and highlight all the hits in a document, not just the best one(s). If future highlighters still allowed this, even if only by judicious use of subclasses, I would be happy Thanks, David – David Kaelbling Senior Software Engineer Black Duck Software, Inc. dkaelbling@blackducksoftware.com T +1.781.810.2041 F +1.781.891.5145 http://www.blackducksoftware.com
          Hide
          Marvin Humphrey added a comment -

          > OK, it sounds like one can simply use different models to score
          > fragdocs and it's still an open debate how much each of these criteria
          > (IDF, showing surround context, being on sentence boundary, diversity
          > of terms) should impact the score.

          With Michael Busch's priority queue approach, the algorithm for choosing the
          fragments can be abstracted into the class of object we put in the queue and
          its lessThan() method. The output from the queue just has to be something the
          Highlighter can chew.

          Show
          Marvin Humphrey added a comment - > OK, it sounds like one can simply use different models to score > fragdocs and it's still an open debate how much each of these criteria > (IDF, showing surround context, being on sentence boundary, diversity > of terms) should impact the score. With Michael Busch's priority queue approach, the algorithm for choosing the fragments can be abstracted into the class of object we put in the queue and its lessThan() method. The output from the queue just has to be something the Highlighter can chew.
          Hide
          Michael McCandless added a comment -

          Something like that. An array of span scores is too limited; a full fledged
          class would do better. Designing that class requires striking a balance
          between what information we think is useful and what information Highlighter
          can sanely reduce.

          Agreed, and I'm not sure about the tree structure (just floating
          ideas...). It could very well be overkill.

          By proposing the tree structure, you're suggesting that
          Highlighter will reverse engineer boolean matching; that sounds like a lot of
          work to me.

          It wouldn't be reverse engineered: BooleanQuery/Weight/Scorer2 itself
          will have returned that. Ie we would add a method to
          "getSpanTree()".

          Still, the KS highlighter probably wouldn't do what you describe. The proximity
          boosting accelerates as the spans approach each other, and maxes out if
          they're adjacent. So "bush bush" might be prefered over "president bush",
          but "bush or bush" proabably wouldn't.

          OK, it sounds like one can simply use different models to score
          fragdocs and it's still an open debate how much each of these criteria
          (IDF, showing surround context, being on sentence boundary, diversity
          of terms) should impact the score. I agree, the "basic litmus test" I
          proposed is too strong.

          Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries.

          Thats not necessarily a property of the Highlighter, just the basic
          implementations we currently supply for the pluggable classes. You can
          supply a custom fragmenter and you can control the number of
          fragments.

          I agree: H1 is very pluggable and one could plug in a better
          fragmenter, but we don't offer such an impl in H1, and this is a case
          where "out-of-the-box defaults" are very important.

          Show
          Michael McCandless added a comment - Something like that. An array of span scores is too limited; a full fledged class would do better. Designing that class requires striking a balance between what information we think is useful and what information Highlighter can sanely reduce. Agreed, and I'm not sure about the tree structure (just floating ideas...). It could very well be overkill. By proposing the tree structure, you're suggesting that Highlighter will reverse engineer boolean matching; that sounds like a lot of work to me. It wouldn't be reverse engineered: BooleanQuery/Weight/Scorer2 itself will have returned that. Ie we would add a method to "getSpanTree()". Still, the KS highlighter probably wouldn't do what you describe. The proximity boosting accelerates as the spans approach each other, and maxes out if they're adjacent. So "bush bush" might be prefered over "president bush", but "bush or bush" proabably wouldn't. OK, it sounds like one can simply use different models to score fragdocs and it's still an open debate how much each of these criteria (IDF, showing surround context, being on sentence boundary, diversity of terms) should impact the score. I agree, the "basic litmus test" I proposed is too strong. Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries. Thats not necessarily a property of the Highlighter, just the basic implementations we currently supply for the pluggable classes. You can supply a custom fragmenter and you can control the number of fragments. I agree: H1 is very pluggable and one could plug in a better fragmenter, but we don't offer such an impl in H1, and this is a case where "out-of-the-box defaults" are very important.
          Hide
          Mark Miller added a comment -

          Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries.

          Thats not necessarily a property of the Highlighter, just the basic implementations we currently supply for the pluggable classes. You can supply a custom fragmenter and you can control the number of fragments.

          Show
          Mark Miller added a comment - Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries. Thats not necessarily a property of the Highlighter, just the basic implementations we currently supply for the pluggable classes. You can supply a custom fragmenter and you can control the number of fragments.
          Hide
          Marvin Humphrey added a comment -

          > I think we may need a tree-structured result returned by the
          > Weight/Scorer, compactly representing the "space" of valid fragdocs
          > for this one doc. And then somehow we walk that tree,
          > enumerating/scoring individual "valid" fragdocs that are created from
          > that tree.

          Something like that. An array of span scores is too limited; a full fledged
          class would do better. Designing that class requires striking a balance
          between what information we think is useful and what information Highlighter
          can sanely reduce. By proposing the tree structure, you're suggesting that
          Highlighter will reverse engineer boolean matching; that sounds like a lot of
          work to me.

          >> However, note that IDF would prevent a bunch of hits on "the" from causing too
          >> hot a hotspot in the heat map. So you're likely to see fragments with high
          >> discriminatory value.
          >
          > This still seems subjectively wrong to me. If I search for "president
          > bush", probably bush is the rarer term and so you would favor showing
          > me a single fragment that had bush occur twice, over a fragment that
          > had a single occurrence of president and bush?

          We've ended up in a false dichotomy. Favoring high IDF terms – or more
          accurately, high scoring character position spans – and favoring fragments
          with high term diversity are not mutually exclusive.

          Still, the KS highlighter probably wouldn't do what you describe. The proximity
          boosting accelerates as the spans approach each other, and maxes out if
          they're adjacent. So "bush bush" might be prefered over "president bush",
          but "bush or bush" proabably wouldn't.

          I don't think that there's anything wrong with preferring high term diversity;
          the KS highlighter doesn't happen to support favoring fragments with high term
          diversity now, but would be improved by adding that capability. I just don't
          think term diversity is so important that it qualifies as a "base litmus
          test".

          There are other ways of choosing good fragments, and IDF is one of them. If
          you want to show why a doc matched a query, it makes sense to show the section
          of the document that contributed most to the score, surrounded by a little
          context.

          > Which excerpts don't scan easily right now? Google's, KS's, Lucene's
          > H1 or H2?

          Lucene H1. Too many elipses, and fragments don't prefer to start on sentence
          boundaries.

          I have to qualify the assertion that the fragments don't scan well with the caveat
          that I'm basing this on a personal impression. However, I'm pretty confident
          about that impression. I would be stunned if there were not studies out there
          demonstrating that sentence fragments which begin at the top are easier to
          consume than sentence fragments which begin in the middle.

          Show
          Marvin Humphrey added a comment - > I think we may need a tree-structured result returned by the > Weight/Scorer, compactly representing the "space" of valid fragdocs > for this one doc. And then somehow we walk that tree, > enumerating/scoring individual "valid" fragdocs that are created from > that tree. Something like that. An array of span scores is too limited; a full fledged class would do better. Designing that class requires striking a balance between what information we think is useful and what information Highlighter can sanely reduce. By proposing the tree structure, you're suggesting that Highlighter will reverse engineer boolean matching; that sounds like a lot of work to me. >> However, note that IDF would prevent a bunch of hits on "the" from causing too >> hot a hotspot in the heat map. So you're likely to see fragments with high >> discriminatory value. > > This still seems subjectively wrong to me. If I search for "president > bush", probably bush is the rarer term and so you would favor showing > me a single fragment that had bush occur twice, over a fragment that > had a single occurrence of president and bush? We've ended up in a false dichotomy. Favoring high IDF terms – or more accurately, high scoring character position spans – and favoring fragments with high term diversity are not mutually exclusive. Still, the KS highlighter probably wouldn't do what you describe. The proximity boosting accelerates as the spans approach each other, and maxes out if they're adjacent. So "bush bush" might be prefered over "president bush", but "bush or bush" proabably wouldn't. I don't think that there's anything wrong with preferring high term diversity; the KS highlighter doesn't happen to support favoring fragments with high term diversity now, but would be improved by adding that capability. I just don't think term diversity is so important that it qualifies as a "base litmus test". There are other ways of choosing good fragments, and IDF is one of them. If you want to show why a doc matched a query, it makes sense to show the section of the document that contributed most to the score, surrounded by a little context. > Which excerpts don't scan easily right now? Google's, KS's, Lucene's > H1 or H2? Lucene H1. Too many elipses, and fragments don't prefer to start on sentence boundaries. I have to qualify the assertion that the fragments don't scan well with the caveat that I'm basing this on a personal impression. However, I'm pretty confident about that impression. I would be stunned if there were not studies out there demonstrating that sentence fragments which begin at the top are easier to consume than sentence fragments which begin in the middle.
          Hide
          Michael McCandless added a comment -

          OK to sum up here with observations / wish list / ideas /
          controversies / etc. for Lucene's future merged highlighter:

          • Fragmenter should aim for fast "eye + brain scanning
            consumability" (eg, try hard to start on sentence boundaries,
            include context)
          • Let's try for single source – each Query/Weight/Scorer should be
            able to enumerate the set of term positions/spans that caused it
            to match a specific doc (like explain(), but provides
            positions/spans detailing the match). Trying to "reverse
            engineer" the matching is brittle
          • Sliding window is better than static "top down" fragmentation
          • To scale, we should make a simple IndexReader impl on top of term
            vectors, but still allow the "re-index single doc on the fly"
            option
          • Favoring breadth (more unique terms instead of many occurences of
            certain terms) seems important, except for too-many-term queries
            where this gets unwieldy
          • Prefer a single fragment if it scores well enough, but fall back
            to several, if necessary, to show "breadth"
          • Produce structured output so non-HTML front ends (eg Flex) can
            render
          • Try to include "context around the hits", when possible (eg the
            "favor middle of hte sentence" that Michael described)
          • Maybe or maybe don't let IDF affect fragment scoring
          • Performance is important – use TermVectors if present, add early
            termination if you've already found a good enough fragdoc, etc.
          • Maybe a tree-based fragdoc enumeration / searching model; I think
            this'd be even more efficient than sliding window, especially for
            large docs
          • Multi-color, HeatMap default ootb HTML UIs are nice
          • It's all very subjective and quite a good challenge!!

          In the meantime, it seems like we should commit this H2 and give users
          the choice? We can then iterate over time on our wish list....

          Show
          Michael McCandless added a comment - OK to sum up here with observations / wish list / ideas / controversies / etc. for Lucene's future merged highlighter: Fragmenter should aim for fast "eye + brain scanning consumability" (eg, try hard to start on sentence boundaries, include context) Let's try for single source – each Query/Weight/Scorer should be able to enumerate the set of term positions/spans that caused it to match a specific doc (like explain(), but provides positions/spans detailing the match). Trying to "reverse engineer" the matching is brittle Sliding window is better than static "top down" fragmentation To scale, we should make a simple IndexReader impl on top of term vectors, but still allow the "re-index single doc on the fly" option Favoring breadth (more unique terms instead of many occurences of certain terms) seems important, except for too-many-term queries where this gets unwieldy Prefer a single fragment if it scores well enough, but fall back to several, if necessary, to show "breadth" Produce structured output so non-HTML front ends (eg Flex) can render Try to include "context around the hits", when possible (eg the "favor middle of hte sentence" that Michael described) Maybe or maybe don't let IDF affect fragment scoring Performance is important – use TermVectors if present, add early termination if you've already found a good enough fragdoc, etc. Maybe a tree-based fragdoc enumeration / searching model; I think this'd be even more efficient than sliding window, especially for large docs Multi-color, HeatMap default ootb HTML UIs are nice It's all very subjective and quite a good challenge!! In the meantime, it seems like we should commit this H2 and give users the choice? We can then iterate over time on our wish list....
          Hide
          Michael McCandless added a comment -

          >> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
          >> spans produced by their children.
          >
          > Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the
          > fact that you should try to include a match from each of the sub-clauses' spans.

          A good idea. ANDQuery's highlightSpans() method could probably be improved by
          post-processing the child spans to take this into account. That way we
          wouldn't have to gum up the main Highlighter code with a bunch of conditionals
          which afford special treatment to certain query types.

          I think we may need a tree-structured result returned by the
          Weight/Scorer, compactly representing the "space" of valid fragdocs
          for this one doc. And then somehow we walk that tree,
          enumerating/scoring individual "valid" fragdocs that are created from
          that tree.

          > What I meant was: all other things being equal, do you more strongly
          > favor a fragment that has all N of the terms in a query vs another
          > fragment that has fewer than N but say higher net number of occurrences.

          No, the diversity of the terms in a fragment isn't factored in. The span
          objects only tell the Highlighter that a particular range of characters
          was important; they don't say why.

          However, note that IDF would prevent a bunch of hits on "the" from causing too
          hot a hotspot in the heat map. So you're likely to see fragments with high
          discriminatory value.

          This still seems subjectively wrong to me. If I search for "president
          bush", probably bush is the rarer term and so you would favor showing
          me a single fragment that had bush occur twice, over a fragment that
          had a single occurrence of president and bush?

          > Google picks more than one fragment; it seems like it picks one or two
          > fragments.

          I probably overstated my opposition to supplying an excerpt containing more
          than one fragment. It seems OK to me to select more than one, so long as they
          all scan easily, and so long as the excerpts don't get long enough to force
          excessive scrolling and slow down the time it takes the user to scan the whole
          results page.

          What bothers me is that the excerpts don't scan easily right now. I consider
          that a much more important defect than the fact that the fragdoc doesn't hit
          every term (which isn't even possible for large queries), and it seemed to me
          that pursuing exhaustive term matching was likely to yield even more highly
          fragmented, visually chaotic fragdocs.

          Which excerpts don't scan easily right now? Google's, KS's, Lucene's
          H1 or H2?

          I think with a tree structure representing the search space for all
          fragdocs, we could then efficiently enumerate fradocs with an
          appropriate scoring model (favoring sentence starts or surrounding
          context, breadth of terms, etc.). This way we can do a real search
          (on all fragdocs) subject to the preference for
          consumability/breadth.

          Show
          Michael McCandless added a comment - >> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the >> spans produced by their children. > > Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the > fact that you should try to include a match from each of the sub-clauses' spans. A good idea. ANDQuery's highlightSpans() method could probably be improved by post-processing the child spans to take this into account. That way we wouldn't have to gum up the main Highlighter code with a bunch of conditionals which afford special treatment to certain query types. I think we may need a tree-structured result returned by the Weight/Scorer, compactly representing the "space" of valid fragdocs for this one doc. And then somehow we walk that tree, enumerating/scoring individual "valid" fragdocs that are created from that tree. > What I meant was: all other things being equal, do you more strongly > favor a fragment that has all N of the terms in a query vs another > fragment that has fewer than N but say higher net number of occurrences. No, the diversity of the terms in a fragment isn't factored in. The span objects only tell the Highlighter that a particular range of characters was important; they don't say why. However, note that IDF would prevent a bunch of hits on "the" from causing too hot a hotspot in the heat map. So you're likely to see fragments with high discriminatory value. This still seems subjectively wrong to me. If I search for "president bush", probably bush is the rarer term and so you would favor showing me a single fragment that had bush occur twice, over a fragment that had a single occurrence of president and bush? > Google picks more than one fragment; it seems like it picks one or two > fragments. I probably overstated my opposition to supplying an excerpt containing more than one fragment. It seems OK to me to select more than one, so long as they all scan easily, and so long as the excerpts don't get long enough to force excessive scrolling and slow down the time it takes the user to scan the whole results page. What bothers me is that the excerpts don't scan easily right now. I consider that a much more important defect than the fact that the fragdoc doesn't hit every term (which isn't even possible for large queries), and it seemed to me that pursuing exhaustive term matching was likely to yield even more highly fragmented, visually chaotic fragdocs. Which excerpts don't scan easily right now? Google's, KS's, Lucene's H1 or H2? I think with a tree structure representing the search space for all fragdocs, we could then efficiently enumerate fradocs with an appropriate scoring model (favoring sentence starts or surrounding context, breadth of terms, etc.). This way we can do a real search (on all fragdocs) subject to the preference for consumability/breadth.
          Hide
          Marvin Humphrey added a comment -

          >> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
          >> spans produced by their children.
          >
          > Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the
          > fact that you should try to include a match from each of the sub-clauses' spans.

          A good idea. ANDQuery's highlightSpans() method could probably be improved by
          post-processing the child spans to take this into account. That way we
          wouldn't have to gum up the main Highlighter code with a bunch of conditionals
          which afford special treatment to certain query types.

          > What I meant was: all other things being equal, do you more strongly
          > favor a fragment that has all N of the terms in a query vs another
          > fragment that has fewer than N but say higher net number of occurrences.

          No, the diversity of the terms in a fragment isn't factored in. The span
          objects only tell the Highlighter that a particular range of characters
          was important; they don't say why.

          However, note that IDF would prevent a bunch of hits on "the" from causing too
          hot a hotspot in the heat map. So you're likely to see fragments with high
          discriminatory value.

          > Google picks more than one fragment; it seems like it picks one or two
          > fragments.

          I probably overstated my opposition to supplying an excerpt containing more
          than one fragment. It seems OK to me to select more than one, so long as they
          all scan easily, and so long as the excerpts don't get long enough to force
          excessive scrolling and slow down the time it takes the user to scan the whole
          results page.

          What bothers me is that the excerpts don't scan easily right now. I consider
          that a much more important defect than the fact that the fragdoc doesn't hit
          every term (which isn't even possible for large queries), and it seemed to me
          that pursuing exhaustive term matching was likely to yield even more highly
          fragmented, visually chaotic fragdocs.

          Show
          Marvin Humphrey added a comment - >> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the >> spans produced by their children. > > Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the > fact that you should try to include a match from each of the sub-clauses' spans. A good idea. ANDQuery's highlightSpans() method could probably be improved by post-processing the child spans to take this into account. That way we wouldn't have to gum up the main Highlighter code with a bunch of conditionals which afford special treatment to certain query types. > What I meant was: all other things being equal, do you more strongly > favor a fragment that has all N of the terms in a query vs another > fragment that has fewer than N but say higher net number of occurrences. No, the diversity of the terms in a fragment isn't factored in. The span objects only tell the Highlighter that a particular range of characters was important; they don't say why. However, note that IDF would prevent a bunch of hits on "the" from causing too hot a hotspot in the heat map. So you're likely to see fragments with high discriminatory value. > Google picks more than one fragment; it seems like it picks one or two > fragments. I probably overstated my opposition to supplying an excerpt containing more than one fragment. It seems OK to me to select more than one, so long as they all scan easily, and so long as the excerpts don't get long enough to force excessive scrolling and slow down the time it takes the user to scan the whole results page. What bothers me is that the excerpts don't scan easily right now. I consider that a much more important defect than the fact that the fragdoc doesn't hit every term (which isn't even possible for large queries), and it seemed to me that pursuing exhaustive term matching was likely to yield even more highly fragmented, visually chaotic fragdocs.
          Hide
          Michael Busch added a comment - - edited

          I wrote the highlighter for the OmniFind Yahoo Edition a few years ago
          and I totally agree that all this stuff is very subjective.

          The OYE highlighter is of course based on Lucene and uses a sliding
          window too. It also uses information about sentence boundaries and
          prefers fragments that start at the beginning of a sentence.

          So it goes through the document and generates fragment candidates on
          the fly. It calculates a score for each fragment and puts it into a
          priority queue. The score is calculated using different heuristics:

          • fragments are boosted that start at the beginning of a sentence
          • the more highlighted terms a fragment contains, the higher is it
            scored
          • more different highlighted terms scores higher than a lot of
          • occurrences of the same term
          • no tf-idf is used
          • if a fragment does not start at the beginning of a sentence, then it
            is scored higher if the highlighted term(s) occur(s) more in the middle
            of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a'
            is the highlighted term; this is being done to show as much context as
            possible around a highlighted term
          • only a single long fragment is created if it contains all query terms
            (like google)
          • The queue tries to gather fragments, so that the union of the fragments
            contain as many different query terms as possible. So it might toss a
            fragment in favor of one with a lower score, if it increases the
            total number of different highlighted terms.
          • For performance reasons there is an early termination if the
            fragments in the queue contain all query terms.

          Initially this highlighter also imitated Lucene's behavior to find the
          highlighted positions. Last year I changed it to use SpanQueries. With
          our flexible query parser (which I introduced on java-dev recently) we
          have two different QueryBuilders. One creates the "normal" query, that
          is executed to find the matching docs. Then a different QueryBuilder
          creates SpanQueries from the same query for the highlighter.

          The output of the highlighter is not formatted html, but rather an
          object containing the unformatted text, together with offset
          information for both fragments and highlights. These offset spans can
          carry additional information, which can be used for multi-color
          highlighting too. We then use an HTMLFormatter class to generate the
          formatted text, also an XMLFormatter that keeps the offset information
          separate from the actual text is possible (we're currently working on
          such a XMLFormatter). This is useful for frontends written in e.g. Flex.

          The performance of our highlighter is good and so far we have been
          pretty happy with the quality of the excerpts, but there is still much
          room for improvements.

          I'd be happy to help working on a new highlighter. I think this is a
          very important component, and Lucene's core should have a very good
          and flexible one.

          Show
          Michael Busch added a comment - - edited I wrote the highlighter for the OmniFind Yahoo Edition a few years ago and I totally agree that all this stuff is very subjective. The OYE highlighter is of course based on Lucene and uses a sliding window too. It also uses information about sentence boundaries and prefers fragments that start at the beginning of a sentence. So it goes through the document and generates fragment candidates on the fly. It calculates a score for each fragment and puts it into a priority queue. The score is calculated using different heuristics: fragments are boosted that start at the beginning of a sentence the more highlighted terms a fragment contains, the higher is it scored more different highlighted terms scores higher than a lot of occurrences of the same term no tf-idf is used if a fragment does not start at the beginning of a sentence, then it is scored higher if the highlighted term(s) occur(s) more in the middle of the fragment: e.g. 'a b c d e' scores lower than 'b c a d e' if 'a' is the highlighted term; this is being done to show as much context as possible around a highlighted term only a single long fragment is created if it contains all query terms (like google) The queue tries to gather fragments, so that the union of the fragments contain as many different query terms as possible. So it might toss a fragment in favor of one with a lower score, if it increases the total number of different highlighted terms. For performance reasons there is an early termination if the fragments in the queue contain all query terms. Initially this highlighter also imitated Lucene's behavior to find the highlighted positions. Last year I changed it to use SpanQueries. With our flexible query parser (which I introduced on java-dev recently) we have two different QueryBuilders. One creates the "normal" query, that is executed to find the matching docs. Then a different QueryBuilder creates SpanQueries from the same query for the highlighter. The output of the highlighter is not formatted html, but rather an object containing the unformatted text, together with offset information for both fragments and highlights. These offset spans can carry additional information, which can be used for multi-color highlighting too. We then use an HTMLFormatter class to generate the formatted text, also an XMLFormatter that keeps the offset information separate from the actual text is possible (we're currently working on such a XMLFormatter). This is useful for frontends written in e.g. Flex. The performance of our highlighter is good and so far we have been pretty happy with the quality of the excerpts, but there is still much room for improvements. I'd be happy to help working on a new highlighter. I think this is a very important component, and Lucene's core should have a very good and flexible one.
          Hide
          Mark Miller added a comment -

          But, flattening could produce just a and d (I think?); and I think H1
          could do the same even with SpanScorer (Mark is that true? I don't
          fully understand the Query -> SpanQuery conversion).

          Right - SpanScorer won't follow boolean logic - it will just break down each clause and not highlight a NOT - similar to standard H1. If a particular clause is position sensitive, it will only be 'lit if its found in a valid position, but thats as deep as it goes.

          Show
          Mark Miller added a comment - But, flattening could produce just a and d (I think?); and I think H1 could do the same even with SpanScorer (Mark is that true? I don't fully understand the Query -> SpanQuery conversion). Right - SpanScorer won't follow boolean logic - it will just break down each clause and not highlight a NOT - similar to standard H1. If a particular clause is position sensitive, it will only be 'lit if its found in a valid position, but thats as deep as it goes.
          Hide
          Michael McCandless added a comment -

          > Do you require term vectors to be stored, for highlighting (cannot
          > re-analyze the text)?

          Yes, but that's not fundamental to the design. You just have to hand the
          Weight some sort of single-doc index that includes sufficient data to
          determine what parts of the text contributed to the hit and how much they
          contributed. The Weight needn't care whether that single-doc index was
          created on the fly or stored at index time.

          OK.

          > For queries that normally do not use positions at all (simple AND/OR
          > of terms), how does your highlightSpans() work?

          ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
          spans produced by their children.

          Hmm – it seems like that loses information. Ie, for ANDQuery, you
          lose the fact that you should try to include a match from each of the
          sub-clauses' spans.

          > For BooleanQuery, is coord factor used to favor fragment sets that
          > include more unique terms?

          No; I don't think that would be fine grained enough to help.

          What I meant was: all other things being equal, do you more strongly
          favor a fragment that has all N of the terms in a query vs another
          fragment that has fewer than N but say higher net number of
          occurrences.

          There's a HeatMap class that performs additional weighting. Spans that
          cluster together tightly (i.e. that could fit together within the excerpt) are
          boosted.

          That sounds great.

          > Are you guaranteed to always present a net set of fragments that
          > "matches" the query? (eg the example query above).

          No. The KS version supplies a single fragment. It naturally prefers
          fragments with rarer terms, because the span scores are multiplied by the
          Weight's weighting factor (which includes IDF).

          Hmm OK.

          Once that fragment is selected, the KS highlighter worries a lot about
          trimming to sensible sentence boundaries.

          I totally agree: easy/fast consumability is very important, so
          choosing entire sentences, or at least anchoring the start or maybe
          end on a sentence boundary, is important. Lucene's H1 doesn't do this
          ootb today I think (though you could plug in your own fragmenter).

          In my own subjective judgment, supplying a single maximally coherent fragment
          which prefers clusters of rare terms results in an excerpt which "scans" as
          quickly as possible, conveying the gist of the content with minimal "visual
          effort". I used Google's excerpting as a model.

          Google picks more than one fragment; it seems like it picks one or two
          fragments.

          I'm torn on whether IDF should really come into play though...

          > I think the base litmus test for a hightlighter is: if one were to
          > take all fragments presented for a document (call this a "fragdoc")
          > and make a new document from it, would that document match the
          > original query?

          With out the aid of formal studies to guide us, this is a subjective call.
          FWIW, I disagree. In my view, visual scanning speed and coherence
          are more important than completeness.

          I'm not a big fan of the multi-fragment approach, because I think it takes too
          much effort to grok each individual entry. Furthermore, the fact that the
          fragments don't start on sentence boundaries (whenever feasible) adds to the
          visual effort needed to orient yourself.

          Search results contain a lot of junk. The user needs to be able to parse the
          results page as quickly as possible and refine their search query as needed.
          Noisy excerpts, with lots of elipses and few sentences that can be "swallowed
          whole" impede that. Trees vs. Forest.

          Again, that's my own aesthetic judgment, but I'll wager that there are studies
          out there showing that fragments which start at the top of a sentence are
          easier to consume, and I think that's important.

          I agree, it's not cut and dry here; this is all quite subjective.

          I think one case that's tricky is two terms that do not tend do
          co-occur in proximity. Eg search for python greenspan on Google, and
          most of the fragdocs consist of two fragments, one for each term. Ie
          google is trying to include all the terms in the fragdoc (my "coord
          factor" question above).

          > In fact, I think the perfect highlighter would "logically" work as
          > follows: take a single document and enumerate every single possible
          > fragdoc.

          KS uses a sliding window rather than chunking up the text into fragdocs of
          fixed length.

          Or, the allowed length of each fragment could span a specified min/max
          range.

          And I like the sliding window approach instead of the pre-fragment
          approach.

          (Note: a fragdoc is one or more fragments stuck together, ie, the
          entire excerpt.)

          Show
          Michael McCandless added a comment - > Do you require term vectors to be stored, for highlighting (cannot > re-analyze the text)? Yes, but that's not fundamental to the design. You just have to hand the Weight some sort of single-doc index that includes sufficient data to determine what parts of the text contributed to the hit and how much they contributed. The Weight needn't care whether that single-doc index was created on the fly or stored at index time. OK. > For queries that normally do not use positions at all (simple AND/OR > of terms), how does your highlightSpans() work? ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the spans produced by their children. Hmm – it seems like that loses information. Ie, for ANDQuery, you lose the fact that you should try to include a match from each of the sub-clauses' spans. > For BooleanQuery, is coord factor used to favor fragment sets that > include more unique terms? No; I don't think that would be fine grained enough to help. What I meant was: all other things being equal, do you more strongly favor a fragment that has all N of the terms in a query vs another fragment that has fewer than N but say higher net number of occurrences. There's a HeatMap class that performs additional weighting. Spans that cluster together tightly (i.e. that could fit together within the excerpt) are boosted. That sounds great. > Are you guaranteed to always present a net set of fragments that > "matches" the query? (eg the example query above). No. The KS version supplies a single fragment. It naturally prefers fragments with rarer terms, because the span scores are multiplied by the Weight's weighting factor (which includes IDF). Hmm OK. Once that fragment is selected, the KS highlighter worries a lot about trimming to sensible sentence boundaries. I totally agree: easy/fast consumability is very important, so choosing entire sentences, or at least anchoring the start or maybe end on a sentence boundary, is important. Lucene's H1 doesn't do this ootb today I think (though you could plug in your own fragmenter). In my own subjective judgment, supplying a single maximally coherent fragment which prefers clusters of rare terms results in an excerpt which "scans" as quickly as possible, conveying the gist of the content with minimal "visual effort". I used Google's excerpting as a model. Google picks more than one fragment; it seems like it picks one or two fragments. I'm torn on whether IDF should really come into play though... > I think the base litmus test for a hightlighter is: if one were to > take all fragments presented for a document (call this a "fragdoc") > and make a new document from it, would that document match the > original query? With out the aid of formal studies to guide us, this is a subjective call. FWIW, I disagree. In my view, visual scanning speed and coherence are more important than completeness. I'm not a big fan of the multi-fragment approach, because I think it takes too much effort to grok each individual entry. Furthermore, the fact that the fragments don't start on sentence boundaries (whenever feasible) adds to the visual effort needed to orient yourself. Search results contain a lot of junk. The user needs to be able to parse the results page as quickly as possible and refine their search query as needed. Noisy excerpts, with lots of elipses and few sentences that can be "swallowed whole" impede that. Trees vs. Forest. Again, that's my own aesthetic judgment, but I'll wager that there are studies out there showing that fragments which start at the top of a sentence are easier to consume, and I think that's important. I agree, it's not cut and dry here; this is all quite subjective. I think one case that's tricky is two terms that do not tend do co-occur in proximity. Eg search for python greenspan on Google, and most of the fragdocs consist of two fragments, one for each term. Ie google is trying to include all the terms in the fragdoc (my "coord factor" question above). > In fact, I think the perfect highlighter would "logically" work as > follows: take a single document and enumerate every single possible > fragdoc. KS uses a sliding window rather than chunking up the text into fragdocs of fixed length. Or, the allowed length of each fragment could span a specified min/max range. And I like the sliding window approach instead of the pre-fragment approach. (Note: a fragdoc is one or more fragments stuck together, ie, the entire excerpt.)
          Hide
          Michael McCandless added a comment -

          PS: Obviously, refinements of the highlighting algo will help Lucy, too. I
          don't suppose you want to continue this on the Lucy dev list so that Lucy
          banks some community credit for this discussion. :\

          Well... remember that more discussions between you and I and Nathan on
          Lucy-dev (as much as I love having them) don't really "count" as a
          "bigger" community. In other words, like the scoring of a
          BooleanQuery, there is a very strong coord factor at play when
          measuring "community". If you and I and nathan have fewer
          conversations on Lucy-dev, but then two other new people join in, that
          is a much stronger community.

          So, maybe send a note to lucy-dev, referencing this as a relevant
          discussion to Lucy's approach to highlighting... and leave a
          tantalizing invite here for others to make the jump to lucy-dev.
          Growing a community is not easy!

          Show
          Michael McCandless added a comment - PS: Obviously, refinements of the highlighting algo will help Lucy, too. I don't suppose you want to continue this on the Lucy dev list so that Lucy banks some community credit for this discussion. :\ Well... remember that more discussions between you and I and Nathan on Lucy-dev (as much as I love having them) don't really "count" as a "bigger" community. In other words, like the scoring of a BooleanQuery, there is a very strong coord factor at play when measuring "community". If you and I and nathan have fewer conversations on Lucy-dev, but then two other new people join in, that is a much stronger community. So, maybe send a note to lucy-dev, referencing this as a relevant discussion to Lucy's approach to highlighting... and leave a tantalizing invite here for others to make the jump to lucy-dev. Growing a community is not easy!
          Hide
          Marvin Humphrey added a comment -

          > Do you require term vectors to be stored, for highlighting (cannot
          > re-analyze the text)?

          Yes, but that's not fundamental to the design. You just have to hand the
          Weight some sort of single-doc index that includes sufficient data to
          determine what parts of the text contributed to the hit and how much they
          contributed. The Weight needn't care whether that single-doc index was
          created on the fly or stored at index time.

          > For queries that normally do not use positions at all (simple AND/OR
          > of terms), how does your highlightSpans() work?

          ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
          spans produced by their children.

          > For BooleanQuery, is coord factor used to favor fragment sets that
          > include more unique terms?

          No; I don't think that would be fine grained enough to help.

          There's a HeatMap class that performs additional weighting. Spans that
          cluster together tightly (i.e. that could fit together within the excerpt) are
          boosted.

          > Are you guaranteed to always present a net set of fragments that
          > "matches" the query? (eg the example query above).

          No. The KS version supplies a single fragment. It naturally prefers
          fragments with rarer terms, because the span scores are multiplied by the
          Weight's weighting factor (which includes IDF).

          Once that fragment is selected, the KS highlighter worries a lot about
          trimming to sensible sentence boundaries.

          In my own subjective judgment, supplying a single maximally coherent fragment
          which prefers clusters of rare terms results in an excerpt which "scans" as
          quickly as possible, conveying the gist of the content with minimal "visual
          effort". I used Google's excerpting as a model.

          > I think the base litmus test for a hightlighter is: if one were to
          > take all fragments presented for a document (call this a "fragdoc")
          > and make a new document from it, would that document match the
          > original query?

          With out the aid of formal studies to guide us, this is a subjective call.
          FWIW, I disagree. In my view, visual scanning speed and coherence
          are more important than completeness.

          I'm not a big fan of the multi-fragment approach, because I think it takes too
          much effort to grok each individual entry. Furthermore, the fact that the
          fragments don't start on sentence boundaries (whenever feasible) adds to the
          visual effort needed to orient yourself.

          Search results contain a lot of junk. The user needs to be able to parse the
          results page as quickly as possible and refine their search query as needed.
          Noisy excerpts, with lots of elipses and few sentences that can be "swallowed
          whole" impede that. Trees vs. Forest.

          Again, that's my own aesthetic judgment, but I'll wager that there are studies
          out there showing that fragments which start at the top of a sentence are
          easier to consume, and I think that's important.

          > In fact, I think the perfect highlighter would "logically" work as
          > follows: take a single document and enumerate every single possible
          > fragdoc.

          KS uses a sliding window rather than chunking up the text into fragdocs of
          fixed length.

          > Having a whole separate package trying to reverse-engineer where matches had
          > taken place between Query and Document is hard to get right.

          Exactly.

          PS: Obviously, refinements of the highlighting algo will help Lucy, too. I
          don't suppose you want to continue this on the Lucy dev list so that Lucy
          banks some community credit for this discussion. :\

          Show
          Marvin Humphrey added a comment - > Do you require term vectors to be stored, for highlighting (cannot > re-analyze the text)? Yes, but that's not fundamental to the design. You just have to hand the Weight some sort of single-doc index that includes sufficient data to determine what parts of the text contributed to the hit and how much they contributed. The Weight needn't care whether that single-doc index was created on the fly or stored at index time. > For queries that normally do not use positions at all (simple AND/OR > of terms), how does your highlightSpans() work? ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the spans produced by their children. > For BooleanQuery, is coord factor used to favor fragment sets that > include more unique terms? No; I don't think that would be fine grained enough to help. There's a HeatMap class that performs additional weighting. Spans that cluster together tightly (i.e. that could fit together within the excerpt) are boosted. > Are you guaranteed to always present a net set of fragments that > "matches" the query? (eg the example query above). No. The KS version supplies a single fragment. It naturally prefers fragments with rarer terms, because the span scores are multiplied by the Weight's weighting factor (which includes IDF). Once that fragment is selected, the KS highlighter worries a lot about trimming to sensible sentence boundaries. In my own subjective judgment, supplying a single maximally coherent fragment which prefers clusters of rare terms results in an excerpt which "scans" as quickly as possible, conveying the gist of the content with minimal "visual effort". I used Google's excerpting as a model. > I think the base litmus test for a hightlighter is: if one were to > take all fragments presented for a document (call this a "fragdoc") > and make a new document from it, would that document match the > original query? With out the aid of formal studies to guide us, this is a subjective call. FWIW, I disagree. In my view, visual scanning speed and coherence are more important than completeness. I'm not a big fan of the multi-fragment approach, because I think it takes too much effort to grok each individual entry. Furthermore, the fact that the fragments don't start on sentence boundaries (whenever feasible) adds to the visual effort needed to orient yourself. Search results contain a lot of junk. The user needs to be able to parse the results page as quickly as possible and refine their search query as needed. Noisy excerpts, with lots of elipses and few sentences that can be "swallowed whole" impede that. Trees vs. Forest. Again, that's my own aesthetic judgment, but I'll wager that there are studies out there showing that fragments which start at the top of a sentence are easier to consume, and I think that's important. > In fact, I think the perfect highlighter would "logically" work as > follows: take a single document and enumerate every single possible > fragdoc. KS uses a sliding window rather than chunking up the text into fragdocs of fixed length. > Having a whole separate package trying to reverse-engineer where matches had > taken place between Query and Document is hard to get right. Exactly. PS: Obviously, refinements of the highlighting algo will help Lucy, too. I don't suppose you want to continue this on the Lucy dev list so that Lucy banks some community credit for this discussion. :\
          Hide
          Michael McCandless added a comment -

          > It'd be sort of like a positional-aware "explain", ie "show me the term
          > occurrences that allowed the full query to accept this document".

          FWIW, this is more or less how the KinoSearch highlighter now works in svn
          trunk. It doesn't use a Scorer, though, but instead the KS analogue to
          Lucene's "Weight" class.

          The (Weight) is fed what is essentially a single doc index, using stored term
          vectors. Weight.highlightSpans() returns an array of "span" objects, each of
          which has a start offset, a length, and a score. The Highlighter then
          processes these span objects to create a "heat map" and choose its excerpt
          points.

          The idea is that by delegating responsibility for creating the scoring spans, we
          make it easier to support arbitrary Query implementations with a single
          Highlighter class.

          Awesome!

          Do you require term vectors to be stored, for highlighting (cannot
          re-analyze the text)?

          For queries that normally do not use positions at all (simple AND/OR
          of terms), how does your highlightSpans() work?

          For BooleanQuery, is coord factor used to favor fragment sets that
          include more unique terms?

          Are you guaranteed to always present a net set of fragments that
          "matches" the query? (eg the example query above).

          I think the base litmus test for a hightlighter is: if one were to
          take all fragments presented for a document (call this a "fragdoc")
          and make a new document from it, would that document match the
          original query?

          In fact, I think the perfect highlighter would "logically" work as
          follows: take a single document and enumerate every single possible
          fragdoc. Each fragdoc is allowed to have maxNumFragments fragments,
          where each fragment has a min/max number of characters. The set of
          fragdocs is of course ridiculously immense.

          Take this massive collection of fragdocs and build a new temporary
          index, then run your Query against that index. Many of the fragdocs
          would not match the Query, so they are eliminated right off (this is
          the litmus test). Then, of the ones that do, you want the highest
          scoring fragdocs.

          Obviously you can't actually implement a highlighter like that, but I
          think "logically" that is the optimal highlighter that we are trying
          to emulate with more efficient implementations.

          I think having the Query/Weight/Scorer class be the single-source for
          hits, explanation & highlight spans is the right approach. Having a
          whole separate package trying to reverse-engineer where matches had
          taken place between Query and Document is hard to get right. EG
          BooleanScorer2's coord factor would naturally/correctly influence the
          selection.

          I also think building a [reduced, just Postings] IndexReader API on top of
          TermVectors ought to be a simple way to get great performance here.

          Show
          Michael McCandless added a comment - > It'd be sort of like a positional-aware "explain", ie "show me the term > occurrences that allowed the full query to accept this document". FWIW, this is more or less how the KinoSearch highlighter now works in svn trunk. It doesn't use a Scorer, though, but instead the KS analogue to Lucene's "Weight" class. The (Weight) is fed what is essentially a single doc index, using stored term vectors. Weight.highlightSpans() returns an array of "span" objects, each of which has a start offset, a length, and a score. The Highlighter then processes these span objects to create a "heat map" and choose its excerpt points. The idea is that by delegating responsibility for creating the scoring spans, we make it easier to support arbitrary Query implementations with a single Highlighter class. Awesome! Do you require term vectors to be stored, for highlighting (cannot re-analyze the text)? For queries that normally do not use positions at all (simple AND/OR of terms), how does your highlightSpans() work? For BooleanQuery, is coord factor used to favor fragment sets that include more unique terms? Are you guaranteed to always present a net set of fragments that "matches" the query? (eg the example query above). I think the base litmus test for a hightlighter is: if one were to take all fragments presented for a document (call this a "fragdoc") and make a new document from it, would that document match the original query? In fact, I think the perfect highlighter would "logically" work as follows: take a single document and enumerate every single possible fragdoc. Each fragdoc is allowed to have maxNumFragments fragments, where each fragment has a min/max number of characters. The set of fragdocs is of course ridiculously immense. Take this massive collection of fragdocs and build a new temporary index, then run your Query against that index. Many of the fragdocs would not match the Query, so they are eliminated right off (this is the litmus test). Then, of the ones that do, you want the highest scoring fragdocs. Obviously you can't actually implement a highlighter like that, but I think "logically" that is the optimal highlighter that we are trying to emulate with more efficient implementations. I think having the Query/Weight/Scorer class be the single-source for hits, explanation & highlight spans is the right approach. Having a whole separate package trying to reverse-engineer where matches had taken place between Query and Document is hard to get right. EG BooleanScorer2's coord factor would naturally/correctly influence the selection. I also think building a [reduced, just Postings] IndexReader API on top of TermVectors ought to be a simple way to get great performance here.
          Hide
          Marvin Humphrey added a comment -

          > It'd be sort of like a positional-aware "explain", ie "show me the term
          > occurrences that allowed the full query to accept this document".

          FWIW, this is more or less how the KinoSearch highlighter now works in svn
          trunk. It doesn't use a Scorer, though, but instead the KS analogue to
          Lucene's "Weight" class.

          The (Weight) is fed what is essentially a single doc index, using stored term
          vectors. Weight.highlightSpans() returns an array of "span" objects, each of
          which has a start offset, a length, and a score. The Highlighter then
          processes these span objects to create a "heat map" and choose its excerpt
          points.

          The idea is that by delegating responsibility for creating the scoring spans, we
          make it easier to support arbitrary Query implementations with a single
          Highlighter class.

          Show
          Marvin Humphrey added a comment - > It'd be sort of like a positional-aware "explain", ie "show me the term > occurrences that allowed the full query to accept this document". FWIW, this is more or less how the KinoSearch highlighter now works in svn trunk. It doesn't use a Scorer, though, but instead the KS analogue to Lucene's "Weight" class. The (Weight) is fed what is essentially a single doc index, using stored term vectors. Weight.highlightSpans() returns an array of "span" objects, each of which has a start offset, a length, and a score. The Highlighter then processes these span objects to create a "heat map" and choose its excerpt points. The idea is that by delegating responsibility for creating the scoring spans, we make it easier to support arbitrary Query implementations with a single Highlighter class.
          Hide
          Mark Miller added a comment - - edited

          Is the reason why H1 creates the full token stream (even when
          TermVectors is the source) in order to build the MemoryIndex?

          If term vectors (w/ positions, offsets) were stored, wouldn't it be
          possible to make a simple index (or at least TermDocs, TermPositions)
          wrapped on those TermVectors?

          It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters.

          edit

          I suppose one of the main problems with my briefly tested large doc approach I tried is that it still requires that you rebuild the tokenstream (and I was attempting to not use termvectors either). Avoiding the need for that would probably make it much more competitive.

          Show
          Mark Miller added a comment - - edited Is the reason why H1 creates the full token stream (even when TermVectors is the source) in order to build the MemoryIndex? If term vectors (w/ positions, offsets) were stored, wouldn't it be possible to make a simple index (or at least TermDocs, TermPositions) wrapped on those TermVectors? It creates the full tokenstream because it was designed to work without termvectors, and so without offset info for the query terms, it rebuilds the stream and processes a token at a time - the api gives you hooks to highlight at any of these tokens - thats essentially the bottleneck I think - taking everything a token at a time, but the whole API is based on that fact. With the SpanScorer version, we can get almost any info from the MemoryIndex, but it was convenient to fit into the current highlighter API to start. I had it in my mind to break from the API and make a largedoc highlighter that didn't need termvectors, but I found the memory index and getspans to still be too slow in my initial testing. I'd hoped to work more on it, but havn't had a chance. So essentially, while more can be done with termvectors, the improvements break the current API at a pretty deep level - no one has done the work to solve that I guess - which is why we have the alternate highlighters. edit I suppose one of the main problems with my briefly tested large doc approach I tried is that it still requires that you rebuild the tokenstream (and I was attempting to not use termvectors either). Avoiding the need for that would probably make it much more competitive.
          Hide
          Michael McCandless added a comment -

          Not sure it solves being able to gets offsets from the query terms and still mask for positions though

          Can you explain that more?

          Show
          Michael McCandless added a comment - Not sure it solves being able to gets offsets from the query terms and still mask for positions though Can you explain that more?
          Hide
          Michael McCandless added a comment -

          But, if you have for example a document 'a b c a b c' and the query
          'a AND b', then this approach would only highlight the first two terms,
          no?

          Ahh right – in fact, nothing would be highlighted because the scorer
          for AND queries doesn't visit positions at all (it doesn't need to).

          I guess we'd have to ask such scorers to forcefully visit positions &
          enumerate all matches within one doc, when running in "highlight"
          mode. Hmm, feeling like a big change...

          But maybe it could work. It'd be sort of like a positional-aware
          "explain", ie "show me the term occurrences that allowed the full
          query to accept this document".

          Imagine query "(a AND b) OR (c AND d)". When looking at the fragments
          for each doc, I would want to see both a AND b, or both c AND d, but
          never (for example) just a and d.

          But, flattening could produce just a and d (I think?); and I think H1
          could do the same even with SpanScorer (Mark is that true? I don't
          fully understand the Query -> SpanQuery conversion).

          Whereas if we could ask for positions of the "real" matches I think it
          would work correctly?

          Show
          Michael McCandless added a comment - But, if you have for example a document 'a b c a b c' and the query 'a AND b', then this approach would only highlight the first two terms, no? Ahh right – in fact, nothing would be highlighted because the scorer for AND queries doesn't visit positions at all (it doesn't need to). I guess we'd have to ask such scorers to forcefully visit positions & enumerate all matches within one doc, when running in "highlight" mode. Hmm, feeling like a big change... But maybe it could work. It'd be sort of like a positional-aware "explain", ie "show me the term occurrences that allowed the full query to accept this document". Imagine query "(a AND b) OR (c AND d)". When looking at the fragments for each doc, I would want to see both a AND b, or both c AND d, but never (for example) just a and d. But, flattening could produce just a and d (I think?); and I think H1 could do the same even with SpanScorer (Mark is that true? I don't fully understand the Query -> SpanQuery conversion). Whereas if we could ask for positions of the "real" matches I think it would work correctly?
          Hide
          Michael McCandless added a comment -

          I'd also like it to work if you don't have termvectors stored (though
          be faster if they are perhaps, as it is now).

          I agree.

          Getting hit positions for position sensitive clauses requires
          converting the query to a span query and calling getSpans on a memory
          index

          Is the reason why H1 creates the full token stream (even when
          TermVectors is the source) in order to build the MemoryIndex?

          If term vectors (w/ positions, offsets) were stored, wouldn't it be
          possible to make a simple index (or at least TermDocs, TermPositions)
          wrapped on those TermVectors?

          Show
          Michael McCandless added a comment - I'd also like it to work if you don't have termvectors stored (though be faster if they are perhaps, as it is now). I agree. Getting hit positions for position sensitive clauses requires converting the query to a span query and calling getSpans on a memory index Is the reason why H1 creates the full token stream (even when TermVectors is the source) in order to build the MemoryIndex? If term vectors (w/ positions, offsets) were stored, wouldn't it be possible to make a simple index (or at least TermDocs, TermPositions) wrapped on those TermVectors?
          Hide
          Michael Busch added a comment -

          I think a nice [eventual] model would be if we could simply re-run the
          scorer on the single document (using InstantiatedIndex maybe, or
          simply some sort of wrapper on the term vectors which are already a
          mini-inverted-index for a single doc), but extend the scorer API to
          tell us the exact term occurrences that participated in a match (which
          I don't think is exposed today).

          But, if you have for example a document 'a b c a b c' and the query
          'a AND b', then this approach would only highlight the first two terms,
          no?

          Show
          Michael Busch added a comment - I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). But, if you have for example a document 'a b c a b c' and the query 'a AND b', then this approach would only highlight the first two terms, no?
          Hide
          Mark Miller added a comment -

          I think a nice [eventual] model would be if we could simply re-run the
          scorer on the single document (using InstantiatedIndex maybe, or
          simply some sort of wrapper on the term vectors which are already a
          mini-inverted-index for a single doc), but extend the scorer API to
          tell us the exact term occurrences that participated in a match (which
          I don't think is exposed today).

          Variations on this have been tossed around before, but this sounds like a slightly more interesting approach than whats been mentioned. Its sort of how the current highlighter handles positions, but avoids the messy step of trying to convert any query to a spanquery.

          Not sure it solves being able to gets offsets from the query terms and still mask for positions though - if that step can be completed, we can start by using the current SpanScorer logic with this patch, until we get the pieces into core Lucene.

          Show
          Mark Miller added a comment - I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). Variations on this have been tossed around before, but this sounds like a slightly more interesting approach than whats been mentioned. Its sort of how the current highlighter handles positions, but avoids the messy step of trying to convert any query to a spanquery. Not sure it solves being able to gets offsets from the query terms and still mask for positions though - if that step can be completed, we can start by using the current SpanScorer logic with this patch, until we get the pieces into core Lucene.
          Hide
          Mark Miller added a comment -

          I don't think its easy to get a speedy highlighter that works with positions for all of the Lucene queries. In the long term, I'd love to see a fast highlighter that works with positions for all of Lucene's queries . I'd also like it to work if you don't have termvectors stored (though be faster if they are perhaps, as it is now).

          Essentially we have each of these pieces separately now - the difficulty is doing it with one highlighter.

          We have the standard Highlighter with two modes: one that doesn't handle positions, and one that handles position sensitive highlighting for Spans and almost all of the queries. This framework is great - its customizable, it handles a lot of corner cases, it works without termvectors, it gets faster with termvectors. Unfortunately, it runs through the source stream one token at a time, and doesn't scale well. Getting hit positions for position sensitive clauses requires converting the query to a span query and calling getSpans on a memory index

          We also have the termvector highlighters that can work from offsets in the query and avoid running through a token at a time. You need termvectors for this approach, and its difficult to handle positions, but it scales.

          The difficulty and goal is in merging the qualities of both.

          Show
          Mark Miller added a comment - I don't think its easy to get a speedy highlighter that works with positions for all of the Lucene queries. In the long term, I'd love to see a fast highlighter that works with positions for all of Lucene's queries . I'd also like it to work if you don't have termvectors stored (though be faster if they are perhaps, as it is now). Essentially we have each of these pieces separately now - the difficulty is doing it with one highlighter. We have the standard Highlighter with two modes: one that doesn't handle positions, and one that handles position sensitive highlighting for Spans and almost all of the queries. This framework is great - its customizable, it handles a lot of corner cases, it works without termvectors, it gets faster with termvectors. Unfortunately, it runs through the source stream one token at a time, and doesn't scale well. Getting hit positions for position sensitive clauses requires converting the query to a span query and calling getSpans on a memory index We also have the termvector highlighters that can work from offsets in the query and avoid running through a token at a time. You need termvectors for this approach, and its difficult to handle positions, but it scales. The difficulty and goal is in merging the qualities of both.
          Hide
          Koji Sekiguchi added a comment -

          Actually I was asking whether every fragment that's returned is
          guaranteed to show a match to my original query.

          EG if my query is a PhraseQuery, is it guaranteed that all fragments
          presented are valid matches? If I search for "Alan Greenspan's
          mortgage", is it ever possible to see a fragment that contains only
          "Alan Greenspan"?

          I see. Yes, H2 guarantees it.

          OK so eg *SpanQuery won't work? It seems like both highlighters take
          this "flatten" approach, which can lose the constraints for
          interesting queries (like Span, or a custom query).

          H2 doesn't support SpanQuery right now. I'll look at SpanScorer and LUCENE-1425 to see whether I can support "interesting queries" in H2, before going to "eventual model" (looks great) you mentioned above.

          Show
          Koji Sekiguchi added a comment - Actually I was asking whether every fragment that's returned is guaranteed to show a match to my original query. EG if my query is a PhraseQuery, is it guaranteed that all fragments presented are valid matches? If I search for "Alan Greenspan's mortgage", is it ever possible to see a fragment that contains only "Alan Greenspan"? I see. Yes, H2 guarantees it. OK so eg *SpanQuery won't work? It seems like both highlighters take this "flatten" approach, which can lose the constraints for interesting queries (like Span, or a custom query). H2 doesn't support SpanQuery right now. I'll look at SpanScorer and LUCENE-1425 to see whether I can support "interesting queries" in H2, before going to "eventual model" (looks great) you mentioned above.
          Hide
          Michael McCandless added a comment -

          I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch feature in Solr", YES. highlighter2 has the feature:

          Actually I was asking whether every fragment that's returned is
          guaranteed to show a match to my original query.

          EG if my query is a PhraseQuery, is it guaranteed that all fragments
          presented are valid matches? If I search for "Alan Greenspan's
          mortgage", is it ever possible to see a fragment that contains only
          "Alan Greenspan"?

          Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the beginning. In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery that contains TermQuery and PhraseQuery:

          OK so eg *SpanQuery won't work? It seems like both highlighters take
          this "flatten" approach, which can lose the constraints for
          interesting queries (like Span, or a custom query).

          I think a nice [eventual] model would be if we could simply re-run the
          scorer on the single document (using InstantiatedIndex maybe, or
          simply some sort of wrapper on the term vectors which are already a
          mini-inverted-index for a single doc), but extend the scorer API to
          tell us the exact term occurrences that participated in a match (which
          I don't think is exposed today).

          EG ExactPhraseScorere.phraseFreq has the logic to check term positions
          and find all positions where the phrase matches. Right now that
          method throws away the specific position where each match occurred,
          but if instead we had it call a normally no-op method
          (recordDocMatchPosition(int position, float score) or some such), we
          could then make use of it for highlighting.

          Show
          Michael McCandless added a comment - I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch feature in Solr", YES. highlighter2 has the feature: Actually I was asking whether every fragment that's returned is guaranteed to show a match to my original query. EG if my query is a PhraseQuery, is it guaranteed that all fragments presented are valid matches? If I search for "Alan Greenspan's mortgage", is it ever possible to see a fragment that contains only "Alan Greenspan"? Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the beginning. In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery that contains TermQuery and PhraseQuery: OK so eg *SpanQuery won't work? It seems like both highlighters take this "flatten" approach, which can lose the constraints for interesting queries (like Span, or a custom query). I think a nice [eventual] model would be if we could simply re-run the scorer on the single document (using InstantiatedIndex maybe, or simply some sort of wrapper on the term vectors which are already a mini-inverted-index for a single doc), but extend the scorer API to tell us the exact term occurrences that participated in a match (which I don't think is exposed today). EG ExactPhraseScorere.phraseFreq has the logic to check term positions and find all positions where the phrase matches. Right now that method throws away the specific position where each match occurred, but if instead we had it call a normally no-op method (recordDocMatchPosition(int position, float score) or some such), we could then make use of it for highlighting.
          Hide
          Koji Sekiguchi added a comment -

          Mark,

          I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing?

          Correct.

          At some stage I hope to take a closer look at this contribution.

          Very nice!

          I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks.

          I'm not sure all H1 test cases could be adapted to work with H2 because boundary of fragments will be different between H1 and H2, but benchmarks of performance is on my todo list.

          Show
          Koji Sekiguchi added a comment - Mark, I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing? Correct. At some stage I hope to take a closer look at this contribution. Very nice! I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks. I'm not sure all H1 test cases could be adapted to work with H2 because boundary of fragments will be different between H1 and H2, but benchmarks of performance is on my todo list.
          Hide
          Koji Sekiguchi added a comment -

          Mike, I'm sorry for late reply.

          Is this approach guaranteed to only highlight term occurrences that actually contribute to the document match?

          I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch feature in Solr", YES. highlighter2 has the feature:

          /**
           * a constructor. A FragListBuilder and a FragmentsBuilder can be specified (plugins).
           * 
           * @param phraseHighlight true of false for phrase highlighting
           * @param fieldMatch true of false for field matching
           * @param fragListBuilder an instance of FragListBuilder
           * @param fragmentsBuilder an instance of FragmentsBuilder
           */
          public Highlighter( boolean phraseHighlight, boolean fieldMatch, FragListBuilder fragListBuilder, FragmentsBuilder fragmentsBuilder ){
            this.phraseHighlight = phraseHighlight;
            this.fieldMatch = fieldMatch;
            this.fragListBuilder = fragListBuilder;
            this.fragmentsBuilder = fragmentsBuilder;
          }
          

          Can it handle all / arbitrary Query subclasses?

          Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the beginning. In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery that contains TermQuery and PhraseQuery:

          FieldQuery.java
          void flatten( Query sourceQuery, Collection<Query> flatQueries ){
            if( sourceQuery instanceof BooleanQuery ){
              BooleanQuery bq = (BooleanQuery)sourceQuery;
              for( BooleanClause clause : bq.getClauses() ){
                if( !clause.isProhibited() )
                  flatten( clause.getQuery(), flatQueries );
              }
            }
            else if( sourceQuery instanceof TermQuery ){
              if( !flatQueries.contains( sourceQuery ) )
                flatQueries.add( sourceQuery );
            }
            else if( sourceQuery instanceof PhraseQuery ){
              if( !flatQueries.contains( sourceQuery ) ){
                PhraseQuery pq = (PhraseQuery)sourceQuery;
                if( pq.getTerms().length > 1 )
                  flatQueries.add( pq );
                else if( pq.getTerms().length == 1 ){
                  flatQueries.add( new TermQuery( pq.getTerms()[0] ) );
                }
              }
            }
            // else discard queries
          }
          

          But I'm always positive to support all / arbitrary Query subclasses in H2.

          How does it score fragments?

          Currently, H2 takes into account query time boost and tf in fragment. For example, if we have q="a OR b^3" and two fragment candidates f1="a a a" and f2="a b", f1 gets 3 and f2 gets 4, getBestFragments() will return f2 first, then f1 when ScoreOrderFragmentsBuilder (default) is used.

          Show
          Koji Sekiguchi added a comment - Mike, I'm sorry for late reply. Is this approach guaranteed to only highlight term occurrences that actually contribute to the document match? I'm not sure if I understand what you are asking, but if you talk about "hl.requireFieldMatch feature in Solr", YES. highlighter2 has the feature: /** * a constructor. A FragListBuilder and a FragmentsBuilder can be specified (plugins). * * @param phraseHighlight true of false for phrase highlighting * @param fieldMatch true of false for field matching * @param fragListBuilder an instance of FragListBuilder * @param fragmentsBuilder an instance of FragmentsBuilder */ public Highlighter( boolean phraseHighlight, boolean fieldMatch, FragListBuilder fragListBuilder, FragmentsBuilder fragmentsBuilder ){ this .phraseHighlight = phraseHighlight; this .fieldMatch = fieldMatch; this .fragListBuilder = fragListBuilder; this .fragmentsBuilder = fragmentsBuilder; } Can it handle all / arbitrary Query subclasses? Currently, no. Highlighter2 calls flatten() method to try to flat the sourceQuery in the beginning. In flatten() method, it recognizes TermQuery and PhraseQuery, and BooleanQuery that contains TermQuery and PhraseQuery: FieldQuery.java void flatten( Query sourceQuery, Collection<Query> flatQueries ){ if ( sourceQuery instanceof BooleanQuery ){ BooleanQuery bq = (BooleanQuery)sourceQuery; for ( BooleanClause clause : bq.getClauses() ){ if ( !clause.isProhibited() ) flatten( clause.getQuery(), flatQueries ); } } else if ( sourceQuery instanceof TermQuery ){ if ( !flatQueries.contains( sourceQuery ) ) flatQueries.add( sourceQuery ); } else if ( sourceQuery instanceof PhraseQuery ){ if ( !flatQueries.contains( sourceQuery ) ){ PhraseQuery pq = (PhraseQuery)sourceQuery; if ( pq.getTerms().length > 1 ) flatQueries.add( pq ); else if ( pq.getTerms().length == 1 ){ flatQueries.add( new TermQuery( pq.getTerms()[0] ) ); } } } // else discard queries } But I'm always positive to support all / arbitrary Query subclasses in H2. How does it score fragments? Currently, H2 takes into account query time boost and tf in fragment. For example, if we have q="a OR b^3" and two fragment candidates f1="a a a" and f2="a b", f1 gets 3 and f2 gets 4, getBestFragments() will return f2 first, then f1 when ScoreOrderFragmentsBuilder (default) is used.
          Hide
          Mark Harwood added a comment -

          I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing?

          At some stage I hope to take a closer look at this contribution. I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks.

          Show
          Mark Harwood added a comment - I'm guessing that's not an issue given it uses stored TermVectors rather than re-analyzing? At some stage I hope to take a closer look at this contribution. I'd be interested to see if all the Highlighter1 Junit tests could be adapted to work with Highlighter2 and get some comparative benchmarks.
          Hide
          Michael McCandless added a comment -

          Does this highlighter have a "max tokens to analyze" setting? Or does it always visit all terms in each document?

          Show
          Michael McCandless added a comment - Does this highlighter have a "max tokens to analyze" setting? Or does it always visit all terms in each document?
          Hide
          Michael McCandless added a comment -

          Note that this issue depends on LUCENE-1448

          Woops, right I had skipped that step.

          Show
          Michael McCandless added a comment - Note that this issue depends on LUCENE-1448 Woops, right I had skipped that step.
          Hide
          Koji Sekiguchi added a comment -

          This highlighter looks very interesting! I love the colored tags, and
          the fast performance on large docs, and the extensive unit tests.

          Thank you for paying attention on this issue, Mike!

          When I applied the patch to current trunk, I see some tests failing,

          Note that this issue depends on LUCENE-1448, so you apply LUCENE-1448.patch first, then apply LUCENE-1522.patch.

          # To apply LUCENE-1448.patch, check out revision 713975!!!
          $ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk
          $ cd trunk
          $ patch -p0 < LUCENE-1448.patch
          $ patch -p0 < LUCENE-1522.patch
          

          I'll post comment later for the rest of your questions.

          Show
          Koji Sekiguchi added a comment - This highlighter looks very interesting! I love the colored tags, and the fast performance on large docs, and the extensive unit tests. Thank you for paying attention on this issue, Mike! When I applied the patch to current trunk, I see some tests failing, Note that this issue depends on LUCENE-1448 , so you apply LUCENE-1448 .patch first, then apply LUCENE-1522 .patch. # To apply LUCENE-1448.patch, check out revision 713975!!! $ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk $ cd trunk $ patch -p0 < LUCENE-1448.patch $ patch -p0 < LUCENE-1522.patch I'll post comment later for the rest of your questions.
          Hide
          Michael McCandless added a comment -

          This highlighter looks very interesting! I love the colored tags, and
          the fast performance on large docs, and the extensive unit tests.

          When I applied the patch to current trunk, I see some tests failing,
          eg:

              [junit] Testcase: test1PhraseLongMVB(org.apache.lucene.search.highlight2.FieldPhraseListTest):	FAILED
              [junit] expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))>
              [junit] junit.framework.ComparisonFailure: expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))>
              [junit] 	at org.apache.lucene.search.highlight2.FieldPhraseListTest.test1PhraseLongMVB(FieldPhraseListTest.java:175)
          

          Is this approach guaranteed to only highlight term occurrences that
          actually contribute to the document match? Can it handle all /
          arbitrary Query subclasses? How does it score fragments?

          I also like that you first generate hits in the document, and from
          those hits you generate fragments (if I'm reading the code correctly);
          this is a nicely scalable approach.

          Show
          Michael McCandless added a comment - This highlighter looks very interesting! I love the colored tags, and the fast performance on large docs, and the extensive unit tests. When I applied the patch to current trunk, I see some tests failing, eg: [junit] Testcase: test1PhraseLongMVB(org.apache.lucene.search.highlight2.FieldPhraseListTest): FAILED [junit] expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))> [junit] junit.framework.ComparisonFailure: expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))> [junit] at org.apache.lucene.search.highlight2.FieldPhraseListTest.test1PhraseLongMVB(FieldPhraseListTest.java:175) Is this approach guaranteed to only highlight term occurrences that actually contribute to the document match? Can it handle all / arbitrary Query subclasses? How does it score fragments? I also like that you first generate hits in the document, and from those hits you generate fragments (if I'm reading the code correctly); this is a nicely scalable approach.
          Hide
          Koji Sekiguchi added a comment -

          The attached patch has "colored tag highlighting" feature.

          I provided the following colored tags:

          BaseFragmentsBuilder.java
          public static final String[] COLORED_PRE_TAGS = {
              "<b style=\"background:yellow\">", "<b style=\"background:lawngreen\">", "<b style=\"background:aquamarine\">",
              "<b style=\"background:magenta\">", "<b style=\"background:palegreen\">", "<b style=\"background:coral\">",
              "<b style=\"background:wheat\">", "<b style=\"background:khaki\">", "<b style=\"background:lime\">",
              "<b style=\"background:deepskyblue\">"
          };
          

          A sample picture will be attached shortly.

          Show
          Koji Sekiguchi added a comment - The attached patch has "colored tag highlighting" feature. I provided the following colored tags: BaseFragmentsBuilder.java public static final String [] COLORED_PRE_TAGS = { "<b style=\" background:yellow\ ">" , "<b style=\" background:lawngreen\ ">" , "<b style=\" background:aquamarine\ ">" , "<b style=\" background:magenta\ ">" , "<b style=\" background:palegreen\ ">" , "<b style=\" background:coral\ ">" , "<b style=\" background:wheat\ ">" , "<b style=\" background:khaki\ ">" , "<b style=\" background:lime\ ">" , "<b style=\" background:deepskyblue\ ">" }; A sample picture will be attached shortly.
          Hide
          Koji Sekiguchi added a comment -

          to apply this patch, LUCENE-1448 also need to be applied.

          $ svn co -r713975 http://svn.apache.org/repos/asf/lucene/java/trunk
          $ cd trunk
          $ patch -p0 < LUCENE-1448.patch
          $ patch -p0 < LUCENE-1522.patch
          
          Show
          Koji Sekiguchi added a comment - to apply this patch, LUCENE-1448 also need to be applied. $ svn co -r713975 http: //svn.apache.org/repos/asf/lucene/java/trunk $ cd trunk $ patch -p0 < LUCENE-1448.patch $ patch -p0 < LUCENE-1522.patch

            People

            • Assignee:
              Mark Miller
              Reporter:
              Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development