[SOLR-1910] Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path from search - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Reopened
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.4
Fix Version/s: None
Component/s: highlighter
Labels:
None

Description

Summary: Patch adds a hl.df parameter, to help with (some) situations where the highlighter currently uses the "wrong" analyzer for highlighting.

What: hl.df is like the normal df parameter, except that it takes effect only during highlighting. (In fact the implementation is basically to temporarily mess with the normal df parameter at the start of highlighting, and then revert to the original value when highlighting is complete.) When hl.df is specified, we make sure not to use the Query object that was parsed by QueryComponent, but rather make our own. In the right circumstances anyway, this means that a more appropriate analyzer gets used for highlighting.

Motivation: Currently, in a normal query+highlighting request, the highlighter re-uses the Query object parsed by the QueryComponent. This can result in incorrect highlights if the field being highlighted is of a different type than the field being queried. In my particular case:

My queries don't explicitly specify field names; they always rely on the default field
My default field for search is "body"
body is a unigram-plus-bigram field. So, e.g. input "audit trail" gets turned into tokens "audit / audit trail / trail". (This is a performance optimzation.)
If I try to highlight directly on "body", the highlights get screwed up. (This is because the highlighter doesn't really support the kind of "continuously overlapping" tokens generated by my analysis chain. In short, the bigrams confuse the TokenGroup class.)
To avoid these highlighting problems, I don't directly highlight "body", but rather a "highlight" field, which has no bigram tokens. ("highlight" is populated from "body" with a copyfield directive.)
Without hl.df, I have a new class of highlighting problems. In particular, if the user enters a phrase search (e.g. "audit trail"), then that phrase appears unhighlighted in the highlighter output. The short version for why is that the analyzer used to parse the query output a Query object that contains bigrams, but the text that we're highlighting doesn't contain bigrams.
With hl.df, the analyzers match up for highlight; the Query object used for highlighting does not contain bigrams, just like the "highlight" field.

(I realize it may help to expand the description of this use case, but I'm a bit hurried right now.)

I wanted to throw this out there, partly in case people have any better solutions. One variation on hl.df option that might be worth considering is hl.UseHighlightedFieldAsDefaultField, which would create a new Query object not just once at the start of highlighting, but separately for each particular field that's getting highlighted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-1910.patch
12/May/10 22:13
10 kB
Chris Harris

Issue Links

relates to

SOLR-937 Highlighting problem related to stemming

Reopened

SOLR-456 Ability to choose another analyzer for field

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Chris Harris

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/May/10 22:12

Updated:: 30/Nov/13 14:14