Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9461

Query hit highlighting components on top of matches API

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 9.0
    • None
    • None
    • New

    Description

      Highlighters. Eventually, you'll have to face them.

      When a Lucene Query is ran over an index, it implies a list of documents that "matched it" - literally a boolean indication of whether the document should be included in the search result or not. In practice, many applications need to convey to users not just the fact that a document matched the query but also some sort of intuitive explanation of why this particular query matched it. While in many cases the relationship is trivial (term containment), in case of complex queries it may not be trivial at all (think of a really short prefix query, a fuzzy term query or even a Boolean disjunction with a high number of possibilities).

      Historically, search engines used to "highlight" the source area of a document that caused the "hit". If a document was too long, it was truncated and only the area around the hit (or hits) was displayed (so called "snippet").

      In my subjective opinion, in the Lucene API highlighters have played a secondary role to queries and search. And once you're trying to build something higher-level, highlighters are a crucial and necessary element of the entire system.

      My experience (and users feedback) from an implementation of a document retrieval system where highlighting was involved was that it just didn't work as expected. Here are the requirements of that system:

      • the query parser uses default field expansion into multiple fields (there is no single "sink" field),
      • the highlights should match exactly what caused the hit; a search for 'title:foo' must not highlight foo in any other field,
      • the set of fields to be highlighted isn't really fixed - there are some fields that should always be displayed - title, summary - and others that should not be displayed unless they're part of the query (in which case the highlight is important and should be shown to the user).
      • highlights should be accurate for all sorts of queries: fuzzy, phrase, prefix, Boolean, spans, etc.,
      • there can be more than one query at one time and they should highlight the same content (with different colors).

      Many highlighters are available in Lucene (vector highlighter, postings highlighter, unified highlighter) but none of them quite fit the bill above. Believe me - we have tried (hard). We ended up using unified highlighter but with subclassing, customizations and all sorts of complex, low-level quirks.

      My gut feeling at that point was that it should be the Query that somehow exposes the information about how a given field content matched. Then I looked at matches API and built a quick prototype retrieving "match regions" on top of that. It works like magic. Here are the key insights:

      • matches API returns exactly what a highlighter needs: for a given query it iterates over fields and positions (including offsets, if they are available) that caused a document to be included in the search result,
      • when matches API cannot provide offsets, it provides elements from which offsets can be computed: positions by re-analyzing the field's value, for example.
      • in extreme cases it may happen the matches API doesn't provide anything useful (a field only indexed, with no stored field value, no positions, no offsets) but I assume it is up to the application layer to know how to deal with this then (or not deal with it at all and throw an exception).
      • matches API delegates the work of providing proper match ranges to the query itself (actually, to the weight a query produces), it doesn't need to know anything about different implementations and their specifics.

      The absolute key element is the last one. Once you build match region retriever, highlighting is a merely about organizing match ranges, dealing with potential overlaps, and proper formatting. It becomes a simple, tractable problem separated from the internals of Lucene Queries.

      The initial set of "highlighter components" in this issue is a set of classes that allows one to assemble a complete pipeline from any query into a set of highlighted document fields. Any highlighter can be essentially built by assembling the following steps:

      • retrieving documents and their fields/ match ranges, given [Query, IndexSearcher],
      • sanitizing match ranges (overlaps, etc.),
      • selecting the "best" snippet for the given set of match ranges,
      • formatting the output (adding start/ end tags for snippets, ellipsis between values, etc.).

      This issue implements components for all of the above steps. It isn't about one highlighter class with tons of options, it's about bits and pieces that can be put together to build anything one desires. This said, an example "high level" highlighter class will also be provided as a sub-task.

      Attachments

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              dweiss Dawid Weiss
              dweiss Dawid Weiss
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h