Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1910

Add hl.df (highlight-specific default field) param, so highlighting can have a separate analysis path from search

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 1.4
    • None
    • highlighter
    • None

    Description

      Summary: Patch adds a hl.df parameter, to help with (some) situations where the highlighter currently uses the "wrong" analyzer for highlighting.

      What: hl.df is like the normal df parameter, except that it takes effect only during highlighting. (In fact the implementation is basically to temporarily mess with the normal df parameter at the start of highlighting, and then revert to the original value when highlighting is complete.) When hl.df is specified, we make sure not to use the Query object that was parsed by QueryComponent, but rather make our own. In the right circumstances anyway, this means that a more appropriate analyzer gets used for highlighting.

      Motivation: Currently, in a normal query+highlighting request, the highlighter re-uses the Query object parsed by the QueryComponent. This can result in incorrect highlights if the field being highlighted is of a different type than the field being queried. In my particular case:

      • My queries don't explicitly specify field names; they always rely on the default field
      • My default field for search is "body"
      • body is a unigram-plus-bigram field. So, e.g. input "audit trail" gets turned into tokens "audit / audit trail / trail". (This is a performance optimzation.)
      • If I try to highlight directly on "body", the highlights get screwed up. (This is because the highlighter doesn't really support the kind of "continuously overlapping" tokens generated by my analysis chain. In short, the bigrams confuse the TokenGroup class.)
      • To avoid these highlighting problems, I don't directly highlight "body", but rather a "highlight" field, which has no bigram tokens. ("highlight" is populated from "body" with a copyfield directive.)
      • Without hl.df, I have a new class of highlighting problems. In particular, if the user enters a phrase search (e.g. "audit trail"), then that phrase appears unhighlighted in the highlighter output. The short version for why is that the analyzer used to parse the query output a Query object that contains bigrams, but the text that we're highlighting doesn't contain bigrams.
      • With hl.df, the analyzers match up for highlight; the Query object used for highlighting does not contain bigrams, just like the "highlight" field.

      (I realize it may help to expand the description of this use case, but I'm a bit hurried right now.)

      I wanted to throw this out there, partly in case people have any better solutions. One variation on hl.df option that might be worth considering is hl.UseHighlightedFieldAsDefaultField, which would create a new Query object not just once at the start of highlighting, but separately for each particular field that's getting highlighted.

      Attachments

        1. SOLR-1910.patch
          10 kB
          Chris Harris

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ryguasu Chris Harris
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: