Originally, I had planned to create a contrib module to allow people to highlight multiple documents in parallel, but after talking to Uwe in IRC about it, I realized that it was pretty useless. However, I was already sitting on an iterative highlighting algorithm that was much faster (my tests show 20% - 40%) and more accurate and, based on that same IRC conversation, I decided to not let all the work that I had done go to waste and try to contribute it back again. Uwe had mentioned that "More like this" detected term vectors when called and use the term vector implementation when possible, if I recall correctly, so I decided to do that.
The patch that I've attached is my first stab at this. It's not nearly complete and full disclosure dictates that I say that it's not fully documented and there are not any unit tests written. I wanted to go ahead and open an issue to get some feedback on the approach that I've taken as well as the fact that it exists will be a proverbial kick in my pants to continue working on it.
In short, what I've changed:
- Completely rewritten the non-tv highlighter to be faster and cleaner. There is some small loss in functionality for now, namely the loss of the GradientHighlighter (I just haven't done this yet) and the lack of exposure of TermFragments and their scores (I can expose this if it is deemed necessary, this is one of the things I'd like feedback on).
- Moved org.apache.lucene.search.vectorhighlight and org.apache.lucene.search.highlight to a single package with a unified interface, search.highlight (with two sub-packages: search.highlight.termvector and search.highlight.iterative, respectively).
- Unified the highlighted term formatting into a single interface: highlighter/Formatter and both highlighters use this now.
What I need to do before I personally would consider this finished:
- Finish documentation, most specifically on TermVectorHighlighter. I haven't done this now as I expect things to change up quite a bit before they're finalized and I really hate writing documentation that goes to waste, but I do intend to complete this bullet
- "Flesh out" the API of search.highlight.Highlighter as it's very barebones right now
- Continue removing and consolidating duplicate functionality, like I've done with the highlighted word tag generation.
What I think I need feedback on, before I can proceed:
- FastTermVectorHighlighter and the iterative highlighters need completely different sets of information in order to work. The approach I've taken is exposing a vectorHighlight method in the unified interface and a iterativeHighlight method, as well as a single highlight method that takes all the information needed for either of them and I'm unsure if this is the best way to do this.
- The naming of things; I'm not sure if this is a big issue, or even an issue at all, but I'd like to not break any conventions that may exist that I'm unaware of.
- How big of a deal is exposing the particular score of a segment from the highlighting interface and does this need to be extended into the term vector highlighting as well?
- There are a lot of methods in the tv implementation that are marked depracted; since this release will almost definitely break backwards compatibility anyway, are these safe to remove?
- Any other input anyone else may have
I'm going to continue to work on things that I can work on, at least unless someone tells me I'm wasting my time and will look forward to hearing you guys' feedback!
As a sidenote because it does seem rather random that I would arbitrarily re-write a working algorithm in the non-tv highlighter, I did it originally because I wanted to parallelize the highlighting (which was a failed experiment) and simply to see if I could make the algorithm faster, as I find that sort of thing particularly fun
As a second sidenote, if anyone would like an explanation of the algorithm for the highlighting I devised, and why I feel that it's more accurate, I'd be happy to provide them with one (and benchmarks as well).