Lucene - Core
  1. Lucene - Core
  2. LUCENE-2508

Consolidate Highlighter implementations and a major refactor of the non-termvector highlighter

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: modules/highlighter
    • Labels:
    • Environment:

      irrelevant

    • Lucene Fields:
      New, Patch Available

      Description

      Originally, I had planned to create a contrib module to allow people to highlight multiple documents in parallel, but after talking to Uwe in IRC about it, I realized that it was pretty useless. However, I was already sitting on an iterative highlighting algorithm that was much faster (my tests show 20% - 40%) and more accurate and, based on that same IRC conversation, I decided to not let all the work that I had done go to waste and try to contribute it back again. Uwe had mentioned that "More like this" detected term vectors when called and use the term vector implementation when possible, if I recall correctly, so I decided to do that.

      The patch that I've attached is my first stab at this. It's not nearly complete and full disclosure dictates that I say that it's not fully documented and there are not any unit tests written. I wanted to go ahead and open an issue to get some feedback on the approach that I've taken as well as the fact that it exists will be a proverbial kick in my pants to continue working on it.

      In short, what I've changed:

      • Completely rewritten the non-tv highlighter to be faster and cleaner. There is some small loss in functionality for now, namely the loss of the GradientHighlighter (I just haven't done this yet) and the lack of exposure of TermFragments and their scores (I can expose this if it is deemed necessary, this is one of the things I'd like feedback on).
      • Moved org.apache.lucene.search.vectorhighlight and org.apache.lucene.search.highlight to a single package with a unified interface, search.highlight (with two sub-packages: search.highlight.termvector and search.highlight.iterative, respectively).
      • Unified the highlighted term formatting into a single interface: highlighter/Formatter and both highlighters use this now.

      What I need to do before I personally would consider this finished:

      • Finish documentation, most specifically on TermVectorHighlighter. I haven't done this now as I expect things to change up quite a bit before they're finalized and I really hate writing documentation that goes to waste, but I do intend to complete this bullet
      • "Flesh out" the API of search.highlight.Highlighter as it's very barebones right now
      • Continue removing and consolidating duplicate functionality, like I've done with the highlighted word tag generation.

      What I think I need feedback on, before I can proceed:

      • FastTermVectorHighlighter and the iterative highlighters need completely different sets of information in order to work. The approach I've taken is exposing a vectorHighlight method in the unified interface and a iterativeHighlight method, as well as a single highlight method that takes all the information needed for either of them and I'm unsure if this is the best way to do this.
      • The naming of things; I'm not sure if this is a big issue, or even an issue at all, but I'd like to not break any conventions that may exist that I'm unaware of.
      • How big of a deal is exposing the particular score of a segment from the highlighting interface and does this need to be extended into the term vector highlighting as well?
      • There are a lot of methods in the tv implementation that are marked depracted; since this release will almost definitely break backwards compatibility anyway, are these safe to remove?
      • Any other input anyone else may have

      I'm going to continue to work on things that I can work on, at least unless someone tells me I'm wasting my time and will look forward to hearing you guys' feedback!

      As a sidenote because it does seem rather random that I would arbitrarily re-write a working algorithm in the non-tv highlighter, I did it originally because I wanted to parallelize the highlighting (which was a failed experiment) and simply to see if I could make the algorithm faster, as I find that sort of thing particularly fun

      As a second sidenote, if anyone would like an explanation of the algorithm for the highlighting I devised, and why I feel that it's more accurate, I'd be happy to provide them with one (and benchmarks as well).

      Thanks,
      Eddie

      1. LUCENE-2508.patch
        291 kB
        Edward Drapkin

        Activity

        Hide
        Edward Drapkin added a comment -

        The "first stab" patch.

        Show
        Edward Drapkin added a comment - The "first stab" patch.
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Scott Stults added a comment -

        This is great! IIRC, one of the todo's in the FVH was to properly integrate it with the existing highlighter. One thing I'm wondering is whether this should be expanded to take in the postings highlighter, or make that integration a follow-on issue. (For one minor example, DefaultPassageFormatter has an HTML escape function that can be shared.)

        Show
        Scott Stults added a comment - This is great! IIRC, one of the todo's in the FVH was to properly integrate it with the existing highlighter. One thing I'm wondering is whether this should be expanded to take in the postings highlighter, or make that integration a follow-on issue. (For one minor example, DefaultPassageFormatter has an HTML escape function that can be shared.)
        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.

          People

          • Assignee:
            Unassigned
            Reporter:
            Edward Drapkin
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 336h
              336h
              Remaining:
              Remaining Estimate - 336h
              336h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development