Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      I created a lucene query highlighter (borrowing some code from the one in
      the sandbox) that my company is using. It better handles phrase queries,
      doesn't break HTML entities, and has the ability to either highlight terms
      in an entire document or to highlight fragments from the document. I would
      like to make it available to anyone who wants it.

        Activity

        Hide
        Mark Miller added a comment -

        Some of this work moved into other issues. Some of it just too old now. I think this issue has served it's purpose.

        Show
        Mark Miller added a comment - Some of this work moved into other issues. Some of it just too old now. I think this issue has served it's purpose.
        Hide
        Mark Miller added a comment -

        Yeah - I would totally close this. This work has been superseded - and it looks like highlighting may be able to take another leap forward soon.

        Show
        Mark Miller added a comment - Yeah - I would totally close this. This work has been superseded - and it looks like highlighting may be able to take another leap forward soon.
        Hide
        Uwe Schindler added a comment -

        Mark Miller: What do you think, is this issue still relevant?

        If not, we should close it and say: resolved by FastVectorHighlighter or because recent improvements in standard highlighter?

        Show
        Uwe Schindler added a comment - Mark Miller: What do you think, is this issue still relevant? If not, we should close it and say: resolved by FastVectorHighlighter or because recent improvements in standard highlighter?
        Hide
        Mark Miller added a comment -

        I would say we do have all of the functionality of this patch +. I have not checked how well this handles all of the corner cases, but it looks like Mark H did a bit of that. I would say it currently offers no functional value though...but it may be faster than what we have for PhraseQuery's (it does not support Spans). The patch uses the offsets from the TokenStream for highlighting and just makes sure PhraseQuery's terms are next to each other (not sure how exact this emulates slop), so this can be rather fast on larger docs.

        I analyzed all of the old Highlight code in JIRA when considering how best to do the SpanScorer, and passed on them for one reason or another. The main pass on this was the lack of Span support, loss of current highlighter features/api, pseudo duplicating Lucene phrase query searching in the Highlighter code. I think a solution that doesn't duplicate Query code is much cleaner.

        So I don't think this is very useful in regards to the general Highlighter. The idea of using Token offset info to do the Highlighting was also tried in Ronnie's JIRA issue (though in that case it was done through TermVectors and not from the TokenStream), and while it proves to be faster on large documents, it doesn't appear easy to retain the speed when working with Spans, and it doesn't fit well with the old API.

        Should we ditch the old API some day though, I have been playing around with this technique with my LargeDocHighlighter, and I still have hope that will go somewhere. I just don't see the old token scoring API being thrown away in the near future.

        Show
        Mark Miller added a comment - I would say we do have all of the functionality of this patch +. I have not checked how well this handles all of the corner cases, but it looks like Mark H did a bit of that. I would say it currently offers no functional value though...but it may be faster than what we have for PhraseQuery's (it does not support Spans). The patch uses the offsets from the TokenStream for highlighting and just makes sure PhraseQuery's terms are next to each other (not sure how exact this emulates slop), so this can be rather fast on larger docs. I analyzed all of the old Highlight code in JIRA when considering how best to do the SpanScorer, and passed on them for one reason or another. The main pass on this was the lack of Span support, loss of current highlighter features/api, pseudo duplicating Lucene phrase query searching in the Highlighter code. I think a solution that doesn't duplicate Query code is much cleaner. So I don't think this is very useful in regards to the general Highlighter. The idea of using Token offset info to do the Highlighting was also tried in Ronnie's JIRA issue (though in that case it was done through TermVectors and not from the TokenStream), and while it proves to be faster on large documents, it doesn't appear easy to retain the speed when working with Spans, and it doesn't fit well with the old API. Should we ditch the old API some day though, I have been playing around with this technique with my LargeDocHighlighter, and I still have hope that will go somewhere. I just don't see the old token scoring API being thrown away in the near future.
        Hide
        Otis Gospodnetic added a comment -

        Mark M || Mark H:

        Do you think you could check out this 3 years old contribution? You did the most work around Highlighter and will be able to see if there is anything to be salvaged here or whether all functionality in this contribution already made it into contrib/highlighter. Thanks.

        Show
        Otis Gospodnetic added a comment - Mark M || Mark H: Do you think you could check out this 3 years old contribution? You did the most work around Highlighter and will be able to see if there is anything to be salvaged here or whether all functionality in this contribution already made it into contrib/highlighter. Thanks.
        Hide
        Daniel Naber added a comment -

        fix title

        Show
        Daniel Naber added a comment - fix title
        Hide
        Sven Duzont added a comment -

        Created an attachment (id=15588)
        A patch to the JUnit test

        Show
        Sven Duzont added a comment - Created an attachment (id=15588) A patch to the JUnit test
        Hide
        Sven Duzont added a comment -

        Created an attachment (id=15587)
        Added a fieldName in case a custom Analyser is passed in arguments

        Show
        Sven Duzont added a comment - Created an attachment (id=15587) Added a fieldName in case a custom Analyser is passed in arguments
        Hide
        Mark Harwood added a comment -

        Created an attachment (id=15568)
        SpansExtractor

        Spans looks like a reasonable way of defining the areas of interest in a doc.
        Heres a class that converts any query (term/phrase/spanNear..) into an array of
        Spans for use in highlighting.

        Show
        Mark Harwood added a comment - Created an attachment (id=15568) SpansExtractor Spans looks like a reasonable way of defining the areas of interest in a doc. Heres a class that converts any query (term/phrase/spanNear..) into an array of Spans for use in highlighting.
        Hide
        David Bohl added a comment -

        > Thanks! Is it possible to contribute it under the Apache Software License,
        > with the copyright statement pointing to Apache, like this?

        Yes, my company has given me permission to give this away. You can change it
        any way you want.

        Show
        David Bohl added a comment - > Thanks! Is it possible to contribute it under the Apache Software License, > with the copyright statement pointing to Apache, like this? Yes, my company has given me permission to give this away. You can change it any way you want.
        Hide
        David Bohl added a comment -

        Created an attachment (id=15563)
        Updated version

        I had already made an update to this to handle phrase queries with a slop set
        (we had a user report this as an error on our site). If there is a slop it
        just highlights individual terms in the phrase (and doesn't check if they are
        near each other).

        Show
        David Bohl added a comment - Created an attachment (id=15563) Updated version I had already made an update to this to handle phrase queries with a slop set (we had a user report this as an error on our site). If there is a slop it just highlights individual terms in the phrase (and doesn't check if they are near each other).
        Hide
        Paul Elschot added a comment -

        (In reply to comment #5)
        ...
        >
        > Do you know if it is possible to rewrite a Phrase query as a SpanQuery and
        > preserve all the behaviour eg slop factor? For the purposes of simplifying
        the
        > highlighter code it may be easier to rewrite PhraseQuerys to Spans and then
        call
        > getSpans as you suggest.

        I'd expect a one to one mapping of PhraseQuery to an ordered SpanNearQuery over
        SpanTermQueries, but I've never done this myself.

        Regards,
        Paul Elschot

        Show
        Paul Elschot added a comment - (In reply to comment #5) ... > > Do you know if it is possible to rewrite a Phrase query as a SpanQuery and > preserve all the behaviour eg slop factor? For the purposes of simplifying the > highlighter code it may be easier to rewrite PhraseQuerys to Spans and then call > getSpans as you suggest. I'd expect a one to one mapping of PhraseQuery to an ordered SpanNearQuery over SpanTermQueries, but I've never done this myself. Regards, Paul Elschot
        Hide
        hoschek added a comment -

        Mark,

        Note that improvements I've made to the MemoryIndex haven't been commited in a while. There's a
        small bug fix for TermEnum and some performance and documentation improvements. One of them
        requires a Term.createTerm() addition, as outlined in the MemoryIndex bugzilla issue. http://
        issues.apache.org/bugzilla/show_bug.cgi?id=34585

        No idea why these things have never been considered. So I'm shipping the source and binary in the Nux
        XQuery library. If you're interested you can get it from there.

        I'm ok with exposing a getReader method or similar. See the comments related to safety in the relevant
        methods. You've probably also seen the constructor that enables term offset indexing, currently private
        until the highlighter package matures.

        Wolfgang.

        Show
        hoschek added a comment - Mark, Note that improvements I've made to the MemoryIndex haven't been commited in a while. There's a small bug fix for TermEnum and some performance and documentation improvements. One of them requires a Term.createTerm() addition, as outlined in the MemoryIndex bugzilla issue. http:// issues.apache.org/bugzilla/show_bug.cgi?id=34585 No idea why these things have never been considered. So I'm shipping the source and binary in the Nux XQuery library. If you're interested you can get it from there. I'm ok with exposing a getReader method or similar. See the comments related to safety in the relevant methods. You've probably also seen the constructor that enables term offset indexing, currently private until the highlighter package matures. Wolfgang.
        Hide
        Mark Harwood added a comment -

        (In reply to comment #4)
        > Given the possibility of nested span queries, it might be best
        > do this is by reindexing the field to be highlighted in ram, reuse
        > the span query on it for collecting the Spans (via getSpans())

        Nice. The MemoryIndex contribution would be a fast way of doing this. I've just
        adapted the LIA SpanQueryTest JUnit test to work with MemoryIndex and all seems
        well doing Spans against MemoryIndex. I had to expose a getReader() method on
        MemoryIndex to do this.

        Do you know if it is possible to rewrite a Phrase query as a SpanQuery and
        preserve all the behaviour eg slop factor? For the purposes of simplifying the
        highlighter code it may be easier to rewrite PhraseQuerys to Spans and then call
        getSpans as you suggest.

        Show
        Mark Harwood added a comment - (In reply to comment #4) > Given the possibility of nested span queries, it might be best > do this is by reindexing the field to be highlighted in ram, reuse > the span query on it for collecting the Spans (via getSpans()) Nice. The MemoryIndex contribution would be a fast way of doing this. I've just adapted the LIA SpanQueryTest JUnit test to work with MemoryIndex and all seems well doing Spans against MemoryIndex. I had to expose a getReader() method on MemoryIndex to do this. Do you know if it is possible to rewrite a Phrase query as a SpanQuery and preserve all the behaviour eg slop factor? For the purposes of simplifying the highlighter code it may be easier to rewrite PhraseQuerys to Spans and then call getSpans as you suggest.
        Hide
        Paul Elschot added a comment -

        (In reply to comment #3)
        ...
        > Going forward I think it would be useful to try retain some of the features
        of
        > the existing highlighter (eg IDF weighted fragment scoring, fragSizes defined
        > in bytes) and merge with your phrase-highlighting features.Adding span query
        > support would be good too. What I'm less clear on right now is how this is
        best
        > achieved.

        Given the possibility of nested span queries, it might be best
        do this is by reindexing the field to be highlighted in ram, reuse
        the span query on it for collecting the Spans (via getSpans())
        and use the beginnings and the ends from this spans as
        the basis for highlighting.
        For efficiency during reindexing the analyzer used to assemble
        the lucene document could ignore all tokens that will not match,
        except for their positions.

        Regards,
        Paul Elschot

        Show
        Paul Elschot added a comment - (In reply to comment #3) ... > Going forward I think it would be useful to try retain some of the features of > the existing highlighter (eg IDF weighted fragment scoring, fragSizes defined > in bytes) and merge with your phrase-highlighting features.Adding span query > support would be good too. What I'm less clear on right now is how this is best > achieved. Given the possibility of nested span queries, it might be best do this is by reindexing the field to be highlighted in ram, reuse the span query on it for collecting the Spans (via getSpans()) and use the beginnings and the ends from this spans as the basis for highlighting. For efficiency during reindexing the analyzer used to assemble the lucene document could ignore all tokens that will not match, except for their positions. Regards, Paul Elschot
        Hide
        Mark Harwood added a comment -

        Created an attachment (id=15541)
        JUnit test

        Hi David,
        Thanks for this. I've still not taken the time to add proper phrase/span query
        support to the current sandbox highlighter so this will definitely be useful to
        folks.
        I've adapted portions of the existing highlighter's Junit test to work on your
        highlighter code and have attached it here. There are some issues I've noted on
        a first pass over the code and these are illustrated in the Junit tests. Some
        of them may be deliberate design choices (eg no slop factor support) but others
        I'd rate as real issues eg lack of fieldname for use with analyzers.
        Going forward I think it would be useful to try retain some of the features of
        the existing highlighter (eg IDF weighted fragment scoring, fragSizes defined
        in bytes) and merge with your phrase-highlighting features.Adding span query
        support would be good too. What I'm less clear on right now is how this is best
        achieved.

        Thanks again,
        Mark

        Show
        Mark Harwood added a comment - Created an attachment (id=15541) JUnit test Hi David, Thanks for this. I've still not taken the time to add proper phrase/span query support to the current sandbox highlighter so this will definitely be useful to folks. I've adapted portions of the existing highlighter's Junit test to work on your highlighter code and have attached it here. There are some issues I've noted on a first pass over the code and these are illustrated in the Junit tests. Some of them may be deliberate design choices (eg no slop factor support) but others I'd rate as real issues eg lack of fieldname for use with analyzers. Going forward I think it would be useful to try retain some of the features of the existing highlighter (eg IDF weighted fragment scoring, fragSizes defined in bytes) and merge with your phrase-highlighting features.Adding span query support would be good too. What I'm less clear on right now is how this is best achieved. Thanks again, Mark
        Hide
        Daniel Naber added a comment -

        Thanks! Is it possible to contribute it under the Apache Software License,
        with the copyright statement pointing to Apache, like this?

        /**

        • Copyright 2005 The Apache Software Foundation
        • Licensed under the Apache License, Version 2.0 (the "License");
        • you may not use this file except in compliance with the License.
        • You may obtain a copy of the License at
        • http://www.apache.org/licenses/LICENSE-2.0
        • Unless required by applicable law or agreed to in writing, software
        • distributed under the License is distributed on an "AS IS" BASIS,
        • WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        • See the License for the specific language governing permissions and
        • limitations under the License.
          */

        This would make it easier to integrate it in our SVN repository.

        Show
        Daniel Naber added a comment - Thanks! Is it possible to contribute it under the Apache Software License, with the copyright statement pointing to Apache, like this? /** Copyright 2005 The Apache Software Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. */ This would make it easier to integrate it in our SVN repository.
        Hide
        David Bohl added a comment -

        Created an attachment (id=15538)
        This is the code for the query highlighter.

        Show
        David Bohl added a comment - Created an attachment (id=15538) This is the code for the query highlighter.

          People

          • Assignee:
            Mark Miller
            Reporter:
            David Bohl
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development