> Do you require term vectors to be stored, for highlighting (cannot
> re-analyze the text)?
Yes, but that's not fundamental to the design. You just have to hand the
Weight some sort of single-doc index that includes sufficient data to
determine what parts of the text contributed to the hit and how much they
contributed. The Weight needn't care whether that single-doc index was
created on the fly or stored at index time.
> For queries that normally do not use positions at all (simple AND/OR
> of terms), how does your highlightSpans() work?
ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
spans produced by their children.
> For BooleanQuery, is coord factor used to favor fragment sets that
> include more unique terms?
No; I don't think that would be fine grained enough to help.
There's a HeatMap class that performs additional weighting. Spans that
cluster together tightly (i.e. that could fit together within the excerpt) are
> Are you guaranteed to always present a net set of fragments that
> "matches" the query? (eg the example query above).
No. The KS version supplies a single fragment. It naturally prefers
fragments with rarer terms, because the span scores are multiplied by the
Weight's weighting factor (which includes IDF).
Once that fragment is selected, the KS highlighter worries a lot about
trimming to sensible sentence boundaries.
In my own subjective judgment, supplying a single maximally coherent fragment
which prefers clusters of rare terms results in an excerpt which "scans" as
quickly as possible, conveying the gist of the content with minimal "visual
effort". I used Google's excerpting as a model.
> I think the base litmus test for a hightlighter is: if one were to
> take all fragments presented for a document (call this a "fragdoc")
> and make a new document from it, would that document match the
> original query?
With out the aid of formal studies to guide us, this is a subjective call.
FWIW, I disagree. In my view, visual scanning speed and coherence
are more important than completeness.
I'm not a big fan of the multi-fragment approach, because I think it takes too
much effort to grok each individual entry. Furthermore, the fact that the
fragments don't start on sentence boundaries (whenever feasible) adds to the
visual effort needed to orient yourself.
Search results contain a lot of junk. The user needs to be able to parse the
results page as quickly as possible and refine their search query as needed.
Noisy excerpts, with lots of elipses and few sentences that can be "swallowed
whole" impede that. Trees vs. Forest.
Again, that's my own aesthetic judgment, but I'll wager that there are studies
out there showing that fragments which start at the top of a sentence are
easier to consume, and I think that's important.
> In fact, I think the perfect highlighter would "logically" work as
> follows: take a single document and enumerate every single possible
KS uses a sliding window rather than chunking up the text into fragdocs of
> Having a whole separate package trying to reverse-engineer where matches had
> taken place between Query and Document is hard to get right.
PS: Obviously, refinements of the highlighting algo will help Lucy, too. I
don't suppose you want to continue this on the Lucy dev list so that Lucy
banks some community credit for this discussion. :\