|
Thanks! Is it possible to contribute it under the Apache Software License,
with the copyright statement pointing to Apache, like this? /**
This would make it easier to integrate it in our SVN repository. Created an attachment (id=15541)
JUnit test Hi David, Thanks again, (In reply to comment #3)
... > Going forward I think it would be useful to try retain some of the features of > the existing highlighter (eg IDF weighted fragment scoring, fragSizes defined > in bytes) and merge with your phrase-highlighting features.Adding span query > support would be good too. What I'm less clear on right now is how this is best > achieved. Given the possibility of nested span queries, it might be best Regards, (In reply to comment #4)
> Given the possibility of nested span queries, it might be best > do this is by reindexing the field to be highlighted in ram, reuse > the span query on it for collecting the Spans (via getSpans()) Nice. The MemoryIndex contribution would be a fast way of doing this. I've just Do you know if it is possible to rewrite a Phrase query as a SpanQuery and Mark,
Note that improvements I've made to the MemoryIndex haven't been commited in a while. There's a No idea why these things have never been considered. So I'm shipping the source and binary in the Nux I'm ok with exposing a getReader method or similar. See the comments related to safety in the relevant Wolfgang. (In reply to comment #5)
... > > Do you know if it is possible to rewrite a Phrase query as a SpanQuery and > preserve all the behaviour eg slop factor? For the purposes of simplifying the > highlighter code it may be easier to rewrite PhraseQuerys to Spans and then call > getSpans as you suggest. I'd expect a one to one mapping of PhraseQuery to an ordered SpanNearQuery over Regards, Created an attachment (id=15563)
Updated version I had already made an update to this to handle phrase queries with a slop set > Thanks! Is it possible to contribute it under the Apache Software License,
> with the copyright statement pointing to Apache, like this? Yes, my company has given me permission to give this away. You can change it Created an attachment (id=15568)
SpansExtractor Spans looks like a reasonable way of defining the areas of interest in a doc. Created an attachment (id=15587)
Added a fieldName in case a custom Analyser is passed in arguments Created an attachment (id=15588)
A patch to the JUnit test Mark M || Mark H:
Do you think you could check out this 3 I would say we do have all of the functionality of this patch +. I have not checked how well this handles all of the corner cases, but it looks like Mark H did a bit of that. I would say it currently offers no functional value though...but it may be faster than what we have for PhraseQuery's (it does not support Spans). The patch uses the offsets from the TokenStream for highlighting and just makes sure PhraseQuery's terms are next to each other (not sure how exact this emulates slop), so this can be rather fast on larger docs.
I analyzed all of the old Highlight code in JIRA when considering how best to do the SpanScorer, and passed on them for one reason or another. The main pass on this was the lack of Span support, loss of current highlighter features/api, pseudo duplicating Lucene phrase query searching in the Highlighter code. I think a solution that doesn't duplicate Query code is much cleaner. So I don't think this is very useful in regards to the general Highlighter. The idea of using Token offset info to do the Highlighting was also tried in Ronnie's JIRA issue (though in that case it was done through TermVectors and not from the TokenStream), and while it proves to be faster on large documents, it doesn't appear easy to retain the speed when working with Spans, and it doesn't fit well with the old API. Should we ditch the old API some day though, I have been playing around with this technique with my LargeDocHighlighter, and I still have hope that will go somewhere. I just don't see the old token scoring API being thrown away in the near future. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is the code for the query highlighter.