Issue Details (XML | Word | Printable)

Key: LUCENE-403
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: David Bohl
Votes: 0
Watchers: 3
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Alternate Lucene Query Highlighter

Created: 28/Jun/05 02:30 AM   Updated: 25/May/08 12:07 PM
Component/s: Other
Affects Version/s: 1.4
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File HighlighterTest.java 2005-07-05 12:35 AM Sven Duzont 57 kB
Java Source File HighlighterTest.java 2005-06-28 07:46 AM Mark Harwood 10 kB
Java Source File QueryHighlighter.java 2005-07-05 12:34 AM Sven Duzont 36 kB
Java Source File QueryHighlighter.java 2005-07-01 12:50 AM David Bohl 15 kB
Java Source File QueryHighlighter.java 2005-06-28 02:32 AM David Bohl 15 kB
Java Source File QuerySpansExtractor.java 2005-07-01 05:30 AM Mark Harwood 3 kB
Environment:
Operating System: All
Platform: All

Bugzilla Id: 35518


 Description  « Hide
I created a lucene query highlighter (borrowing some code from the one in
the sandbox) that my company is using. It better handles phrase queries,
doesn't break HTML entities, and has the ability to either highlight terms
in an entire document or to highlight fragments from the document. I would
like to make it available to anyone who wants it.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
David Bohl added a comment - 28/Jun/05 02:32 AM
Created an attachment (id=15538)
This is the code for the query highlighter.

Daniel Naber added a comment - 28/Jun/05 03:56 AM
Thanks! Is it possible to contribute it under the Apache Software License,
with the copyright statement pointing to Apache, like this?

/**

  • Copyright 2005 The Apache Software Foundation
  • Licensed under the Apache License, Version 2.0 (the "License");
  • you may not use this file except in compliance with the License.
  • You may obtain a copy of the License at
  • http://www.apache.org/licenses/LICENSE-2.0
  • Unless required by applicable law or agreed to in writing, software
  • distributed under the License is distributed on an "AS IS" BASIS,
  • WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  • See the License for the specific language governing permissions and
  • limitations under the License.
    */

This would make it easier to integrate it in our SVN repository.


Mark Harwood added a comment - 28/Jun/05 07:46 AM
Created an attachment (id=15541)
JUnit test

Hi David,
Thanks for this. I've still not taken the time to add proper phrase/span query
support to the current sandbox highlighter so this will definitely be useful to
folks.
I've adapted portions of the existing highlighter's Junit test to work on your
highlighter code and have attached it here. There are some issues I've noted on
a first pass over the code and these are illustrated in the Junit tests. Some
of them may be deliberate design choices (eg no slop factor support) but others
I'd rate as real issues eg lack of fieldname for use with analyzers.
Going forward I think it would be useful to try retain some of the features of
the existing highlighter (eg IDF weighted fragment scoring, fragSizes defined
in bytes) and merge with your phrase-highlighting features.Adding span query
support would be good too. What I'm less clear on right now is how this is best
achieved.

Thanks again,
Mark


Paul Elschot added a comment - 28/Jun/05 04:20 PM
(In reply to comment #3)
...
> Going forward I think it would be useful to try retain some of the features
of
> the existing highlighter (eg IDF weighted fragment scoring, fragSizes defined
> in bytes) and merge with your phrase-highlighting features.Adding span query
> support would be good too. What I'm less clear on right now is how this is
best
> achieved.

Given the possibility of nested span queries, it might be best
do this is by reindexing the field to be highlighted in ram, reuse
the span query on it for collecting the Spans (via getSpans())
and use the beginnings and the ends from this spans as
the basis for highlighting.
For efficiency during reindexing the analyzer used to assemble
the lucene document could ignore all tokens that will not match,
except for their positions.

Regards,
Paul Elschot


Mark Harwood added a comment - 28/Jun/05 07:26 PM
(In reply to comment #4)
> Given the possibility of nested span queries, it might be best
> do this is by reindexing the field to be highlighted in ram, reuse
> the span query on it for collecting the Spans (via getSpans())

Nice. The MemoryIndex contribution would be a fast way of doing this. I've just
adapted the LIA SpanQueryTest JUnit test to work with MemoryIndex and all seems
well doing Spans against MemoryIndex. I had to expose a getReader() method on
MemoryIndex to do this.

Do you know if it is possible to rewrite a Phrase query as a SpanQuery and
preserve all the behaviour eg slop factor? For the purposes of simplifying the
highlighter code it may be easier to rewrite PhraseQuerys to Spans and then call
getSpans as you suggest.


hoschek added a comment - 29/Jun/05 12:21 AM
Mark,

Note that improvements I've made to the MemoryIndex haven't been commited in a while. There's a
small bug fix for TermEnum and some performance and documentation improvements. One of them
requires a Term.createTerm() addition, as outlined in the MemoryIndex bugzilla issue. http://
issues.apache.org/bugzilla/show_bug.cgi?id=34585

No idea why these things have never been considered. So I'm shipping the source and binary in the Nux
XQuery library. If you're interested you can get it from there.

I'm ok with exposing a getReader method or similar. See the comments related to safety in the relevant
methods. You've probably also seen the constructor that enables term offset indexing, currently private
until the highlighter package matures.

Wolfgang.


Paul Elschot added a comment - 29/Jun/05 04:03 AM
(In reply to comment #5)
...
>
> Do you know if it is possible to rewrite a Phrase query as a SpanQuery and
> preserve all the behaviour eg slop factor? For the purposes of simplifying
the
> highlighter code it may be easier to rewrite PhraseQuerys to Spans and then
call
> getSpans as you suggest.

I'd expect a one to one mapping of PhraseQuery to an ordered SpanNearQuery over
SpanTermQueries, but I've never done this myself.

Regards,
Paul Elschot


David Bohl added a comment - 01/Jul/05 12:50 AM
Created an attachment (id=15563)
Updated version

I had already made an update to this to handle phrase queries with a slop set
(we had a user report this as an error on our site). If there is a slop it
just highlights individual terms in the phrase (and doesn't check if they are
near each other).


David Bohl added a comment - 01/Jul/05 12:58 AM
> Thanks! Is it possible to contribute it under the Apache Software License,
> with the copyright statement pointing to Apache, like this?

Yes, my company has given me permission to give this away. You can change it
any way you want.


Mark Harwood added a comment - 01/Jul/05 05:30 AM
Created an attachment (id=15568)
SpansExtractor

Spans looks like a reasonable way of defining the areas of interest in a doc.
Heres a class that converts any query (term/phrase/spanNear..) into an array of
Spans for use in highlighting.


Sven Duzont added a comment - 05/Jul/05 12:34 AM
Created an attachment (id=15587)
Added a fieldName in case a custom Analyser is passed in arguments

Sven Duzont added a comment - 05/Jul/05 12:35 AM
Created an attachment (id=15588)
A patch to the JUnit test

Daniel Naber added a comment - 19/May/07 11:31 AM
fix title

Otis Gospodnetic added a comment - 17/May/08 01:53 AM
Mark M || Mark H:

Do you think you could check out this 3 years old contribution? You did the most work around Highlighter and will be able to see if there is anything to be salvaged here or whether all functionality in this contribution already made it into contrib/highlighter. Thanks.


Mark Miller added a comment - 25/May/08 12:07 PM
I would say we do have all of the functionality of this patch +. I have not checked how well this handles all of the corner cases, but it looks like Mark H did a bit of that. I would say it currently offers no functional value though...but it may be faster than what we have for PhraseQuery's (it does not support Spans). The patch uses the offsets from the TokenStream for highlighting and just makes sure PhraseQuery's terms are next to each other (not sure how exact this emulates slop), so this can be rather fast on larger docs.

I analyzed all of the old Highlight code in JIRA when considering how best to do the SpanScorer, and passed on them for one reason or another. The main pass on this was the lack of Span support, loss of current highlighter features/api, pseudo duplicating Lucene phrase query searching in the Highlighter code. I think a solution that doesn't duplicate Query code is much cleaner.

So I don't think this is very useful in regards to the general Highlighter. The idea of using Token offset info to do the Highlighting was also tried in Ronnie's JIRA issue (though in that case it was done through TermVectors and not from the TokenStream), and while it proves to be faster on large documents, it doesn't appear easy to retain the speed when working with Spans, and it doesn't fit well with the old API.

Should we ditch the old API some day though, I have been playing around with this technique with my LargeDocHighlighter, and I still have hope that will go somewhere. I just don't see the old token scoring API being thrown away in the near future.