Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.2
    • Fix Version/s: 6.3
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The UnifiedHighlighter is an evolution of the PostingsHighlighter that is able to highlight using offsets in either postings, term vectors, or from analysis (a TokenStream). Lucene’s existing highlighters are mostly demarcated along offset source lines, whereas here it is unified – hence this proposed name. In this highlighter, the offset source strategy is separated from the core highlighting functionalty. The UnifiedHighlighter further improves on the PostingsHighlighter’s design by supporting accurate phrase highlighting using an approach similar to the standard highlighter’s WeightedSpanTermExtractor. The next major improvement is a hybrid offset source strategythat utilizes postings and “light” term vectors (i.e. just the terms) for highlighting multi-term queries (wildcards) without resorting to analysis. Phrase highlighting and wildcard highlighting can both be disabled if you’d rather highlight a little faster albeit not as accurately reflecting the query.
      We’ve benchmarked an earlier version of this highlighter comparing it to the other highlighters and the results were exciting! It’s tempting to share those results but it’s definitely due for another benchmark, so we’ll work on that. Performance was the main motivator for creating the UnifiedHighlighter, as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy requirements) wasn’t fast enough, even with term vectors along with several improvements we contributed back, and even after we forked it to highlight in multiple threads.

      1. LUCENE-7438.patch
        393 kB
        David Smiley
      2. LUCENE_7438_UH_small_changes.patch
        9 kB
        David Smiley
      3. LUCENE_7438_UH_benchmark.patch
        28 kB
        David Smiley
      4. LUCENE_7438_UH_benchmark.patch
        58 kB
        David Smiley

        Issue Links

          Activity

          Hide
          Timothy055 Timothy M. Rodriguez added a comment - - edited

          (dsmiley: edited formatting only)

          Some additional information:

          Missing features & possible future improvements:

          Despite the offset source flexibility and accuracy options of this highlighter, it continues to be the case that some highlighters have unique features. The following features are in the standard Highlighter (and possibly FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not PostingsHighlighter either since UH is derived from PH):

          • Being able to disable “requireFieldMatch” to thus highlight a query insensitive to whatever fields are mentioned in the query.
          • Using boosts in the query to weight passages.
          • Regex pased passage delineation. Though I’m unsure if anyone cares given the existing BreakIterator options available.

          Aside from addressing the feature gaps listed above, there are a couple known things that would be nice to add:

          • The phrase highlighting (implemented by PhraseHelper) could be made more accurate, and probably faster too, by using techniques in Alan Woodward's Luwak system that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t done this way to begin with because this highlighter was developed originally for Lucene 4.10.
          • Wildcard queries usually use TokenStreamFromTermVector, which uninverts the terms out of a Terms index. Instead, we now think it would be better to create a bunch of PostingsEnum for each matching term. This would bring about some simplifications and efficiencies, and can lead to better passage relevancy. A bonus would be aggregating terms matching the same automata into a merged PostingsEnum that has a freq() based on the sum of the underlying matching terms.

          Changes from the PostingsHighlighter

          • The UH is more stateful
            • Holds the IndexSearcher instead of asking most methods to pass it through.
            • Options now have simple setters, and the per-field getters return these. This means the common case of a setting being non-specific to a field doesn’t require subclassing.
          • Multi-valued field handling is improved to ensure that a passage will never span across values, plus it honors the positionIncrementGap for an analyzed offset source. See MultiValueTokenStream and SplittingBreakIterator.
          • The PH caches all content to be highlighted for all docs and then highlights it all. The UH has a limit on this which led to a batching approach. But if all fields use an Analyzer or if more than one use term vectors, then instead highlighting happens one doc at a time since the up-front content caching is not helpful.
          • No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one doc to the next. This really simplified some code; it didn’t seem worth it.
          • MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it to guard against ramifications of exceptions being thrown during highlighting (e.g. a BreakIterator bug or TokenStream bug). Nasty to debug!
          • (from standard Highlighter) TokenStreamFromTermVector: optimizations to uninvert filtered (thus sparse) Terms.

          Non-Core Dependencies

          • MemoryIndex: For Analyzer based highlighting when phrases need to be highlighted accurately.
          • Standard Highlighter things:
            • TokenStreamFromTermVector: For most multi-term queries. The UH actually has its own derived copy that has been optimized to handle filtered (thus sparse) Terms. With further work, we could switch to a different approach and remove it (as indicated earlier). For as long as it stays, it’s also possible to replace the existing one with this if we want to do that.
            • WeightedSpanTermExtractor: For highlighting phrases accurately to re-use it’s SpanQuery conversion and rewrite detecting abilities. Perhaps these parts of WSTE could move to general SpanQuery utilities.
            • TermVectorLeafReader: When highlighting offsets from term vectors.
          • PostingHighlighter things:
            • Technically, Nothing however it has multiple copies of some things that have not been modified: Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter.
            • Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: WholeBreakIterator, CustomSeparatorBreakIterator. Maybe they should move to a utils package that isn’t in any of these highlighters?
          Show
          Timothy055 Timothy M. Rodriguez added a comment - - edited (dsmiley: edited formatting only) Some additional information: Missing features & possible future improvements: Despite the offset source flexibility and accuracy options of this highlighter, it continues to be the case that some highlighters have unique features. The following features are in the standard Highlighter (and possibly FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not PostingsHighlighter either since UH is derived from PH): Being able to disable “requireFieldMatch” to thus highlight a query insensitive to whatever fields are mentioned in the query. Using boosts in the query to weight passages. Regex pased passage delineation. Though I’m unsure if anyone cares given the existing BreakIterator options available. Aside from addressing the feature gaps listed above, there are a couple known things that would be nice to add: The phrase highlighting (implemented by PhraseHelper) could be made more accurate, and probably faster too, by using techniques in Alan Woodward 's Luwak system that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t done this way to begin with because this highlighter was developed originally for Lucene 4.10. Wildcard queries usually use TokenStreamFromTermVector, which uninverts the terms out of a Terms index. Instead, we now think it would be better to create a bunch of PostingsEnum for each matching term. This would bring about some simplifications and efficiencies, and can lead to better passage relevancy. A bonus would be aggregating terms matching the same automata into a merged PostingsEnum that has a freq() based on the sum of the underlying matching terms. Changes from the PostingsHighlighter The UH is more stateful Holds the IndexSearcher instead of asking most methods to pass it through. Options now have simple setters, and the per-field getters return these. This means the common case of a setting being non-specific to a field doesn’t require subclassing. Multi-valued field handling is improved to ensure that a passage will never span across values, plus it honors the positionIncrementGap for an analyzed offset source. See MultiValueTokenStream and SplittingBreakIterator. The PH caches all content to be highlighted for all docs and then highlights it all. The UH has a limit on this which led to a batching approach. But if all fields use an Analyzer or if more than one use term vectors, then instead highlighting happens one doc at a time since the up-front content caching is not helpful. No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one doc to the next. This really simplified some code; it didn’t seem worth it. MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it to guard against ramifications of exceptions being thrown during highlighting (e.g. a BreakIterator bug or TokenStream bug). Nasty to debug! (from standard Highlighter) TokenStreamFromTermVector: optimizations to uninvert filtered (thus sparse) Terms. Non-Core Dependencies MemoryIndex: For Analyzer based highlighting when phrases need to be highlighted accurately. Standard Highlighter things: TokenStreamFromTermVector: For most multi-term queries. The UH actually has its own derived copy that has been optimized to handle filtered (thus sparse) Terms. With further work, we could switch to a different approach and remove it (as indicated earlier). For as long as it stays, it’s also possible to replace the existing one with this if we want to do that. WeightedSpanTermExtractor: For highlighting phrases accurately to re-use it’s SpanQuery conversion and rewrite detecting abilities. Perhaps these parts of WSTE could move to general SpanQuery utilities. TermVectorLeafReader: When highlighting offsets from term vectors. PostingHighlighter things: Technically, Nothing however it has multiple copies of some things that have not been modified: Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter. Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: WholeBreakIterator, CustomSeparatorBreakIterator. Maybe they should move to a utils package that isn’t in any of these highlighters?
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user Timothy055 opened a pull request:

          https://github.com/apache/lucene-solr/pull/79

          LUCENE-7438 UnifiedHighlighter

          Initial pull request for LUCENE-7438(https://issues.apache.org/jira/browse/LUCENE-7438)

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/Timothy055/lucene-solr master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/79.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #79


          commit 02e932c4a6146363680b88f4947a693c6697c955
          Author: Timothy Rodriguez <trodriguez25@bloomberg.net>
          Date: 2016-09-01T19:23:50Z

          Initial fork of PostingsHighlighter for UnifiedHighlighter

          commit 9d88411b3985a98851384d78d681431dba710e89
          Author: Timothy Rodriguez <trodriguez25@bloomberg.net>
          Date: 2016-09-01T23:17:06Z

          Initial commit of the UnifiedHighlighter for OSS contribution

          commit e45e39bc4b07ea33e4423b264c2fefb9aa08777a
          Author: David Smiley <david.w.smiley@gmail.com>
          Date: 2016-09-02T12:45:49Z

          Fix misc issues; "ant test" now works. (#1)

          commit 046a28ef31acf4cea7d255bbbb4b827e6a714e3d
          Author: Timothy Rodriguez <trodriguez25@bloomberg.net>
          Date: 2016-09-02T20:58:31Z

          Minor refactoring of the AnalysisFieldHighlighter

          commit ccd1a2280abd4b48cfef8122696e5d9cfd12920f
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-03T12:55:20Z

          AbstractFieldHighlighter: order methods more sensibly; renamed a couple.

          commit d4714a04a3e41d5e95bbe942b275c32ed69b9c2e
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T01:03:29Z

          Improve javadocs and @lucene.external/internal labeling & scope.
          "ant precommit" now passes.

          commit e0659f18a59bf2893076da6d7643ff30f2fa5a52
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T01:25:55Z

          Analysis: remove dubious filter() method

          commit ccd7ce707bff2c06da89b31853cca9aecea72008
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T01:44:01Z

          getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and only call filterExtractedTerms once.

          commit ffc2a22c700b8abcbf87673d5d05bb3659d177c9
          Author: David Smiley <david.w.smiley@gmail.com>
          Date: 2016-09-04T15:21:08Z

          UnifiedHighlighter round 2 (#2)

          • AbstractFieldHighlighter: order methods more sensibly; renamed a couple.
          • Improve javadocs and @lucene.external/internal labeling & scope.
            "ant precommit" now passes.
          • Analysis: remove dubious filter() method
          • getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and only call filterExtractedTerms once.

          commit 5f95e05595db462d3ab5bffc68c2c92f70875072
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T16:12:33Z

          Refactor: FieldOffsetStrategy

          commit 86fb6265fbbdb955ead6d4baf944bf708175715e
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T16:21:32Z

          stop passing maxPassages into highlightFieldForDoc()

          commit f6fd80544eae9fab953b94b1e9346c0883f956eb
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T16:12:33Z

          Refactor: FieldOffsetStrategy

          commit b335a673c2ce45904890c1e9af7cbfda2bd27b0f
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T16:21:32Z

          stop passing maxPassages into highlightFieldForDoc()

          commit 478db9437b92214cbf459f82ba2e3a67c966a150
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T18:29:44Z

          Rename subclasses of FieldOffsetStrategy.

          commit dbf4280755c11420a5032445cd618fadb7444b61
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T18:31:34Z

          Re-order and harmonize params on methods called by UH.getFieldHighlighter()

          commit f0340e27e61dcda2e11992f08ec07a72fad6c24c
          Author: David Smiley <dsmiley@apache.org>
          Date: 2016-09-04T18:53:51Z

          FieldHighlighter: harmonize field/param order. And don't apply maxNoHighlightPasses twice.

          commit 817f63c1d48fd523c13b9c40a2ae9b8a4047209a
          Author: Timothy Rodriguez <trodriguez25@bloomberg.net>
          Date: 2016-09-06T20:43:20Z

          Merge of renaming changes

          commit 0f644a4f53c1ed4d41d562848f6fe51a87442a75
          Author: Timothy Rodriguez <trodriguez25@bloomberg.net>
          Date: 2016-09-06T20:54:13Z

          add visibility tests

          commit 9171f49e117085e7d086267bb73836831ff07f8e
          Author: Timothy Rodriguez <trodriguez25@bloomberg.net>
          Date: 2016-09-07T14:26:59Z

          ADd additional extensibility test


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user Timothy055 opened a pull request: https://github.com/apache/lucene-solr/pull/79 LUCENE-7438 UnifiedHighlighter Initial pull request for LUCENE-7438 ( https://issues.apache.org/jira/browse/LUCENE-7438 ) You can merge this pull request into a Git repository by running: $ git pull https://github.com/Timothy055/lucene-solr master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/79.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #79 commit 02e932c4a6146363680b88f4947a693c6697c955 Author: Timothy Rodriguez <trodriguez25@bloomberg.net> Date: 2016-09-01T19:23:50Z Initial fork of PostingsHighlighter for UnifiedHighlighter commit 9d88411b3985a98851384d78d681431dba710e89 Author: Timothy Rodriguez <trodriguez25@bloomberg.net> Date: 2016-09-01T23:17:06Z Initial commit of the UnifiedHighlighter for OSS contribution commit e45e39bc4b07ea33e4423b264c2fefb9aa08777a Author: David Smiley <david.w.smiley@gmail.com> Date: 2016-09-02T12:45:49Z Fix misc issues; "ant test" now works. (#1) commit 046a28ef31acf4cea7d255bbbb4b827e6a714e3d Author: Timothy Rodriguez <trodriguez25@bloomberg.net> Date: 2016-09-02T20:58:31Z Minor refactoring of the AnalysisFieldHighlighter commit ccd1a2280abd4b48cfef8122696e5d9cfd12920f Author: David Smiley <dsmiley@apache.org> Date: 2016-09-03T12:55:20Z AbstractFieldHighlighter: order methods more sensibly; renamed a couple. commit d4714a04a3e41d5e95bbe942b275c32ed69b9c2e Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T01:03:29Z Improve javadocs and @lucene.external/internal labeling & scope. "ant precommit" now passes. commit e0659f18a59bf2893076da6d7643ff30f2fa5a52 Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T01:25:55Z Analysis: remove dubious filter() method commit ccd7ce707bff2c06da89b31853cca9aecea72008 Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T01:44:01Z getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and only call filterExtractedTerms once. commit ffc2a22c700b8abcbf87673d5d05bb3659d177c9 Author: David Smiley <david.w.smiley@gmail.com> Date: 2016-09-04T15:21:08Z UnifiedHighlighter round 2 (#2) AbstractFieldHighlighter: order methods more sensibly; renamed a couple. Improve javadocs and @lucene.external/internal labeling & scope. "ant precommit" now passes. Analysis: remove dubious filter() method getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and only call filterExtractedTerms once. commit 5f95e05595db462d3ab5bffc68c2c92f70875072 Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T16:12:33Z Refactor: FieldOffsetStrategy commit 86fb6265fbbdb955ead6d4baf944bf708175715e Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T16:21:32Z stop passing maxPassages into highlightFieldForDoc() commit f6fd80544eae9fab953b94b1e9346c0883f956eb Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T16:12:33Z Refactor: FieldOffsetStrategy commit b335a673c2ce45904890c1e9af7cbfda2bd27b0f Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T16:21:32Z stop passing maxPassages into highlightFieldForDoc() commit 478db9437b92214cbf459f82ba2e3a67c966a150 Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T18:29:44Z Rename subclasses of FieldOffsetStrategy. commit dbf4280755c11420a5032445cd618fadb7444b61 Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T18:31:34Z Re-order and harmonize params on methods called by UH.getFieldHighlighter() commit f0340e27e61dcda2e11992f08ec07a72fad6c24c Author: David Smiley <dsmiley@apache.org> Date: 2016-09-04T18:53:51Z FieldHighlighter: harmonize field/param order. And don't apply maxNoHighlightPasses twice. commit 817f63c1d48fd523c13b9c40a2ae9b8a4047209a Author: Timothy Rodriguez <trodriguez25@bloomberg.net> Date: 2016-09-06T20:43:20Z Merge of renaming changes commit 0f644a4f53c1ed4d41d562848f6fe51a87442a75 Author: Timothy Rodriguez <trodriguez25@bloomberg.net> Date: 2016-09-06T20:54:13Z add visibility tests commit 9171f49e117085e7d086267bb73836831ff07f8e Author: Timothy Rodriguez <trodriguez25@bloomberg.net> Date: 2016-09-07T14:26:59Z ADd additional extensibility test
          Hide
          Timothy055 Timothy M. Rodriguez added a comment - - edited

          Pull request here: https://github.com/apache/lucene-solr/pull/79

          I'd also like to specially acknowledge David Smiley who has worked with us closely. He did the lion's share of the work represented here. (Including the genesis of the idea for unifying the disparate highlighters.)

          Show
          Timothy055 Timothy M. Rodriguez added a comment - - edited Pull request here: https://github.com/apache/lucene-solr/pull/79 I'd also like to specially acknowledge David Smiley who has worked with us closely. He did the lion's share of the work represented here. (Including the genesis of the idea for unifying the disparate highlighters.)
          Hide
          dsmiley David Smiley added a comment -

          I think we can avoid some duplication confusion as follows:

          For the internal classes that user's don't normally use:

          • MultiTermHighlighting: transfer most of the changes I did in MultiTermHighlighting to the copy in the postingshighlight package – particularly to anything that already existed there. Then make that public and lucene.internal so it can be accessed. That is very low-impact on the PH. For the couple methods added – uninvertAndFilterTerms and makeStringMatchAutomata I think we can add these to FieldOffsetStrategy and AnalysisOffsetStrategy respectively. And add comments mentioning it would logically go in MTH but since that's in a different highlighter, we don't.
          • TokenStreamFromTermVector: I think we can replace the one in the highlighter with this one, as the sparseness ratio is configurable in the constructor.

          For the surface classes users use: Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter. – I don't think it good to have users use parts of another highlighter (postingshighlight), which is weird for users. I propose copying these with a leading 'U', i.e. UPassage etc. That said if others think that's a worse trade-off, it's no big deal to me. Once o.a.l.s.ph.Passage's constructor is public, it's possible to do that.

          RE benchmarks... not sure when we'll have those ready but I would hope by the end of this month. I figure using our benchmark module on wikipedia is a fine way to go. I've used that to benchmark enhancements to the standard highlighter before.

          Thoughts (esp. from other committers)? Robert Muir, I figure you'll have some valuable feedback as you did most (all?) of the herculean work on the PostingsHighlighter which was an ideal starting point for this UH. I know some folks are on vacation or at another conference right now who I know want to provide feedback so I'm in no hurry to commit anything.

          Show
          dsmiley David Smiley added a comment - I think we can avoid some duplication confusion as follows: For the internal classes that user's don't normally use: MultiTermHighlighting : transfer most of the changes I did in MultiTermHighlighting to the copy in the postingshighlight package – particularly to anything that already existed there. Then make that public and lucene.internal so it can be accessed. That is very low-impact on the PH. For the couple methods added – uninvertAndFilterTerms and makeStringMatchAutomata I think we can add these to FieldOffsetStrategy and AnalysisOffsetStrategy respectively. And add comments mentioning it would logically go in MTH but since that's in a different highlighter, we don't. TokenStreamFromTermVector : I think we can replace the one in the highlighter with this one, as the sparseness ratio is configurable in the constructor. For the surface classes users use: Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter. – I don't think it good to have users use parts of another highlighter ( postingshighlight ), which is weird for users. I propose copying these with a leading 'U', i.e. UPassage etc. That said if others think that's a worse trade-off, it's no big deal to me. Once o.a.l.s.ph.Passage 's constructor is public, it's possible to do that. RE benchmarks... not sure when we'll have those ready but I would hope by the end of this month. I figure using our benchmark module on wikipedia is a fine way to go. I've used that to benchmark enhancements to the standard highlighter before. Thoughts (esp. from other committers)? Robert Muir , I figure you'll have some valuable feedback as you did most (all?) of the herculean work on the PostingsHighlighter which was an ideal starting point for this UH. I know some folks are on vacation or at another conference right now who I know want to provide feedback so I'm in no hurry to commit anything.
          Hide
          Timothy055 Timothy M. Rodriguez added a comment -

          I'm not a fan of forking classes with Uxyz naming scheme. I think it'd be better to make the existing classes re-usable or keep the current naming scheme. That being said, if we make the existing classes re-usable, it might be better to plan on moving them into some common package later on so it's clearer that they are re-used.

          Show
          Timothy055 Timothy M. Rodriguez added a comment - I'm not a fan of forking classes with Uxyz naming scheme. I think it'd be better to make the existing classes re-usable or keep the current naming scheme. That being said, if we make the existing classes re-usable, it might be better to plan on moving them into some common package later on so it's clearer that they are re-used.
          Hide
          rcmuir Robert Muir added a comment -

          I think there are room for plenty of new highlighters in lucene, so its great to have another one. I do get the feeling from some issues etc, that some feel there "can be only one", but I don't see any good reasons for that. On the other hand, like codecs and other things in lucene, we should explore different approaches that give the user more choices (like this new highlighter here).

          I think this is especially important because of how "personal" highlighting is to the app, and the fact that performance/relevance is tricky stuff here depending on how the app works! For example about the reuse note: this highlighter discards reuse of some internal lucene structures, but under some circumstances (e.g. certain query structures/Directory impl/doc sizes/top-N sizes/stopwords or lack thereof) this could indeed matter a lot. For PH it does this simply because it tries to maximize perf everywhere (possibly to the extreme: perhaps it really is the wrong tradeoff, but that was a "different" direction to explore). There are lots of ways these things can perform or be very slow, and a lot of it is hard to generalize across all use-cases!

          As far as the duplication of classes, I'd be a little careful before refactoring too much of it, because of that very reason. Maybe UH needs to ultimately go in different directions than PH and we should just let it do that.

          For example ranking: PH disregards query structure and tries to use a bag-of-words approach with something similar to traditional ranking for that, the idea is that hopefully that stuff works well on a small scale too.

          But UH might need something else: if it attempts to use more query structure than bag-of-words, then UH might need to do something else. I haven't looked to see how things like IDF are computed there, that's just an example.

          And maybe the right direction for PH, given what it tries to do, is to do something like LUCENE-4909... that sorta sits out there because we don't have a good way of measuring quality?

          I also do worry a bit about making internal-only classes like MultitermHighlighting public to the user, I think this has a real heavy API cost. Maybe for that one in particular its the right way to go, given how hacky/hairy it is especially. Maybe it can be renamed to something better to limit the confusion This problem isn't really unique to highlighters though, its something that should be addressed better in general, e.g. with internal-only packages that are hidden or something like that.

          Anyway, these are just some general thoughts. Glad to see we will have more choices.

          Show
          rcmuir Robert Muir added a comment - I think there are room for plenty of new highlighters in lucene, so its great to have another one. I do get the feeling from some issues etc, that some feel there "can be only one", but I don't see any good reasons for that. On the other hand, like codecs and other things in lucene, we should explore different approaches that give the user more choices (like this new highlighter here). I think this is especially important because of how "personal" highlighting is to the app, and the fact that performance/relevance is tricky stuff here depending on how the app works! For example about the reuse note: this highlighter discards reuse of some internal lucene structures, but under some circumstances (e.g. certain query structures/Directory impl/doc sizes/top-N sizes/stopwords or lack thereof) this could indeed matter a lot. For PH it does this simply because it tries to maximize perf everywhere (possibly to the extreme: perhaps it really is the wrong tradeoff, but that was a "different" direction to explore). There are lots of ways these things can perform or be very slow, and a lot of it is hard to generalize across all use-cases! As far as the duplication of classes, I'd be a little careful before refactoring too much of it, because of that very reason. Maybe UH needs to ultimately go in different directions than PH and we should just let it do that. For example ranking: PH disregards query structure and tries to use a bag-of-words approach with something similar to traditional ranking for that, the idea is that hopefully that stuff works well on a small scale too. But UH might need something else: if it attempts to use more query structure than bag-of-words, then UH might need to do something else. I haven't looked to see how things like IDF are computed there, that's just an example. And maybe the right direction for PH, given what it tries to do, is to do something like LUCENE-4909 ... that sorta sits out there because we don't have a good way of measuring quality? I also do worry a bit about making internal-only classes like MultitermHighlighting public to the user, I think this has a real heavy API cost. Maybe for that one in particular its the right way to go, given how hacky/hairy it is especially. Maybe it can be renamed to something better to limit the confusion This problem isn't really unique to highlighters though, its something that should be addressed better in general, e.g. with internal-only packages that are hidden or something like that. Anyway, these are just some general thoughts. Glad to see we will have more choices.
          Hide
          dsmiley David Smiley added a comment -

          Thanks for chiming in Rob.

          For the present while there are feature gaps (see "Missing features" above), I don't think we can suggest that there be only one highlighter. I admit I see that as potential eventuality that I think is desirable, but it's a moot discussion right now. That being said, the UH, being based on the PH, does everything it does and more. It scores/ranks and formats using the same code. The very kernel of the highlighter that produces the Passages[] (now in FieldHighlighter.highlightOffsetsEnums) is essentially the same. Still, I don't think we should do any removing of highlighters at this time. Eventually, we can ask ourselves, what is highlighter XYZ giving us over the UnifiedHighlighter? And then we can see if we (and other users) think it's worth keeping it.

          RE PostingsHighlighter perf trade-offs:

          Yeah I know it's possible to craft an extreme case that would exercise the PostingsEnum reuse – loads of terms in the query and an optimized index. Once we have some benchmarking, we can see how much of a hit was lost by not re-using. That feature was retained in the UH for many months until just recently when it underwent a large refactor to simplify things. Other than this, I don't believe there are any tricks in the PH that we removed in the UH.

          RE ranking/scoring "needs":

          I'm not aware that the UH might have different passing scoring "needs" than the PH. The PH's algorithm seems really nice to me; I didn't put any thought into this aspect. But yeah maybe there might be improvements for phrase/span queries in particular. By the way, PhraseHelper, simply filters out certain occurrences of certain terms. Perhaps the frequency of the span might be used in scoring? But to know that, you must iterate them, and then you lose lazy iteration. Perhaps someone wanting to trade-off performance for possibly better passage relevance would make this trade-off? We/BLAW have no plans to do that. If someone comes along with such a requirement, I hope we can accommodate that interesting direction.

          RE moving / renaming / visibility

          If you have specific suggestions (e.g. w.r.t. MultiTermHighlighting) on how they might be renamed and re-shuffled to different packages than I'd love to hear your thoughts on that. Some things of the UH are expressly public because we/BLAW are using those endpoints but we/BLAW don't use MultiTermHighlighting at this time. But I could imagine some custom wildcard query coming into existence and it would be a PITA if we couldn't help MTH understand some new query. Similar for WSTE.

          BTW there is a visibility test expressly for ensuring certain things are public.

          Show
          dsmiley David Smiley added a comment - Thanks for chiming in Rob. For the present while there are feature gaps (see "Missing features" above), I don't think we can suggest that there be only one highlighter. I admit I see that as potential eventuality that I think is desirable, but it's a moot discussion right now. That being said, the UH, being based on the PH, does everything it does and more. It scores/ranks and formats using the same code. The very kernel of the highlighter that produces the Passages[] (now in FieldHighlighter.highlightOffsetsEnums) is essentially the same. Still, I don't think we should do any removing of highlighters at this time. Eventually, we can ask ourselves, what is highlighter XYZ giving us over the UnifiedHighlighter? And then we can see if we (and other users) think it's worth keeping it. RE PostingsHighlighter perf trade-offs: Yeah I know it's possible to craft an extreme case that would exercise the PostingsEnum reuse – loads of terms in the query and an optimized index. Once we have some benchmarking, we can see how much of a hit was lost by not re-using. That feature was retained in the UH for many months until just recently when it underwent a large refactor to simplify things. Other than this, I don't believe there are any tricks in the PH that we removed in the UH. RE ranking/scoring "needs": I'm not aware that the UH might have different passing scoring "needs" than the PH. The PH's algorithm seems really nice to me; I didn't put any thought into this aspect. But yeah maybe there might be improvements for phrase/span queries in particular. By the way, PhraseHelper, simply filters out certain occurrences of certain terms. Perhaps the frequency of the span might be used in scoring? But to know that, you must iterate them, and then you lose lazy iteration. Perhaps someone wanting to trade-off performance for possibly better passage relevance would make this trade-off? We/BLAW have no plans to do that. If someone comes along with such a requirement, I hope we can accommodate that interesting direction. RE moving / renaming / visibility If you have specific suggestions (e.g. w.r.t. MultiTermHighlighting) on how they might be renamed and re-shuffled to different packages than I'd love to hear your thoughts on that. Some things of the UH are expressly public because we/BLAW are using those endpoints but we/BLAW don't use MultiTermHighlighting at this time. But I could imagine some custom wildcard query coming into existence and it would be a PITA if we couldn't help MTH understand some new query. Similar for WSTE. BTW there is a visibility test expressly for ensuring certain things are public.
          Hide
          Timothy055 Timothy M. Rodriguez added a comment -

          Actually, I think passage relevancy might be something we'd look into in more details down the line. Definitely, some of the things in LUCENE-4909 could be useful. I see merit in keeping things separate to allow for flexibility.

          Show
          Timothy055 Timothy M. Rodriguez added a comment - Actually, I think passage relevancy might be something we'd look into in more details down the line. Definitely, some of the things in LUCENE-4909 could be useful. I see merit in keeping things separate to allow for flexibility.
          Hide
          dsmiley David Smiley added a comment -

          I developed a benchmark using Lucene's benchmark module; it's attached as a patch. I made some changes to some existing classes there and it's debatable if those changes are readily committable. The benchmark is on 200k documents from the wikipedia/enwiki data set. While poking through the data and running some queries through Luke, I developed a few lists of queries: terms, phrases, and wildcards. There are some boolean operators in there, and both phrase and wildcard query lists have some occasional TermQuery clauses intermixed too. I had planned to add another query list but this takes awhile. Due to the differences in index data, I have two similar .alg files, one for full term vectors, and the other for postings. I used the postings one to test analysis as well but it could have been on either. It should be the same document data. Since I have multiple query lists, I did a total of 6 benchmark executions, and each time tweaking the file.query.maker.file param and switching to the other .alg once. In the table below, the first (search) row is the time it takes to search and retrieve the data to highlight but not to actually do any highlighting. It's a baseline. The other numbers are over and above that time. In other words, I subtracted the output from the benchmark for the highlighter modes from the baseline so I could measure highlighting time.

          I tested the standard Highlighter (SH), PostingsHighlighter (PH), FastVectorHighlighter (FVH), and UnifiedHighlighter (UH). The suffix stands for the analysis mode: analysis (A), term vectors (V), postings (P), and postings with light term vectors (PV) – a mode unique to the UH. The code I wrote to test these, where possible, tried to configure them similarly.

          Impl terms phrases wildcards
          (search) 1.08 1.22 1.46
          SH_A 3.92 4.53 9.33
          UH_A 1.91 1.70 3.93
          SH_V 1.83 1.59 3.93
          FVH_V 0.85 1.36 2.40
          UH_V 0.80 1.00 1.94
          PH_P 0.91 0.57 4.02
          UH_P 0.61 0.36 4.03
          UH_PV 0.52 0.35 1.76

          I ranked it by offset mode so you can see things working off the same offset source. Judging from all the runs I did and as I tweaked what was being measured, there seems to be a large % err on these numbers, maybe 15%; I'm not sure. Nevertheless the numbers above seem about right after I have done them a bunch of times and tweaked the benchmark.

          Conclusions: The UH is faster in each offset mode than the others. It is a lot faster in Analysis mode than the standard Highlighter is. In some runs I've also seen the FVH beat out the UH. Note that months ago I ascertained that the FVH is not as sensitive to the performance of an underlying BreakIterator as UH & PH are – so "cheap" BI's like the char separator one make for a UH that handily beats FVH but expensive BI's (like the default JDK provided) make these two more competitive.

          One cool observation that surprised me is the phrase query difference between PH & UH. Despite the accuracy mode of UH (set to true for these benchmarks), it's still faster than PH. I temporarily disabled it and re-ran and found that the UH got slower when it treated them like PH does (bag of terms). I believe that is because the filtering of these terms positions the UH does, while it intrinsically has some cost, seems to be cheaper than the main highlighting loop seeing more occurrences of terms that result in more Passages (which also needs to invoke the BreakIterator). Accuracy & speed – Cool!

          Of course this benchmark could be improved... and it could be modified to measure highlighting shorter text or longer text. And maybe try that case of an optimized index and lots of terms in the query. Maybe benchmark queries with SpanMultiTermQuery in them, or ones with phrases & wildcards. And I was going to measure memory allocation but against this large matrix I changed my mind as I've got other things to get to. I had done so months ago and the results looked great.

          Show
          dsmiley David Smiley added a comment - I developed a benchmark using Lucene's benchmark module; it's attached as a patch. I made some changes to some existing classes there and it's debatable if those changes are readily committable. The benchmark is on 200k documents from the wikipedia/enwiki data set. While poking through the data and running some queries through Luke, I developed a few lists of queries: terms, phrases, and wildcards. There are some boolean operators in there, and both phrase and wildcard query lists have some occasional TermQuery clauses intermixed too. I had planned to add another query list but this takes awhile. Due to the differences in index data, I have two similar .alg files, one for full term vectors, and the other for postings. I used the postings one to test analysis as well but it could have been on either. It should be the same document data. Since I have multiple query lists, I did a total of 6 benchmark executions, and each time tweaking the file.query.maker.file param and switching to the other .alg once. In the table below, the first (search) row is the time it takes to search and retrieve the data to highlight but not to actually do any highlighting. It's a baseline. The other numbers are over and above that time. In other words, I subtracted the output from the benchmark for the highlighter modes from the baseline so I could measure highlighting time. I tested the standard Highlighter (SH), PostingsHighlighter (PH), FastVectorHighlighter (FVH), and UnifiedHighlighter (UH). The suffix stands for the analysis mode: analysis (A), term vectors (V), postings (P), and postings with light term vectors (PV) – a mode unique to the UH. The code I wrote to test these, where possible, tried to configure them similarly. Impl terms phrases wildcards (search) 1.08 1.22 1.46 SH_A 3.92 4.53 9.33 UH_A 1.91 1.70 3.93 SH_V 1.83 1.59 3.93 FVH_V 0.85 1.36 2.40 UH_V 0.80 1.00 1.94 PH_P 0.91 0.57 4.02 UH_P 0.61 0.36 4.03 UH_PV 0.52 0.35 1.76 I ranked it by offset mode so you can see things working off the same offset source. Judging from all the runs I did and as I tweaked what was being measured, there seems to be a large % err on these numbers, maybe 15%; I'm not sure. Nevertheless the numbers above seem about right after I have done them a bunch of times and tweaked the benchmark. Conclusions: The UH is faster in each offset mode than the others. It is a lot faster in Analysis mode than the standard Highlighter is. In some runs I've also seen the FVH beat out the UH. Note that months ago I ascertained that the FVH is not as sensitive to the performance of an underlying BreakIterator as UH & PH are – so "cheap" BI's like the char separator one make for a UH that handily beats FVH but expensive BI's (like the default JDK provided) make these two more competitive. One cool observation that surprised me is the phrase query difference between PH & UH. Despite the accuracy mode of UH (set to true for these benchmarks), it's still faster than PH. I temporarily disabled it and re-ran and found that the UH got slower when it treated them like PH does (bag of terms). I believe that is because the filtering of these terms positions the UH does, while it intrinsically has some cost, seems to be cheaper than the main highlighting loop seeing more occurrences of terms that result in more Passages (which also needs to invoke the BreakIterator). Accuracy & speed – Cool! Of course this benchmark could be improved... and it could be modified to measure highlighting shorter text or longer text. And maybe try that case of an optimized index and lots of terms in the query. Maybe benchmark queries with SpanMultiTermQuery in them, or ones with phrases & wildcards. And I was going to measure memory allocation but against this large matrix I changed my mind as I've got other things to get to. I had done so months ago and the results looked great.
          Hide
          dsmiley David Smiley added a comment -

          BTW just to re-inforce the wide precision in the numbers, note that UH_P & UH_PV should theoretically highlight term queries with identical code (PV gets internally optimized to P when there are no wildcards) yet it differed 0.61 to 0.52 (17%).

          Quoting myself earlier:

          For the surface classes users use: Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter. – I don't think it good to have users use parts of another highlighter (postingshighlight), which is weird for users. I propose copying these with a leading 'U', i.e. UPassage etc. That said if others think that's a worse trade-off, it's no big deal to me. Once o.a.l.s.ph.Passage's constructor is public, it's possible to do that.

          Reconsidering... I think it's fine to use these classes from the PostingsHighlighter... after all, these classes are only actually seen by users when they want to customize them. A user can simply call public methods on UH and reference zero other classes (getting back strings based on default impls of those classes). If there are no objections to this path, Tim/I can update the PR along with the other changes listed in the same comment I just referenced.

          BTW this highlighter has been in production for about 6 months. It rooted out a couple bugs.

          Show
          dsmiley David Smiley added a comment - BTW just to re-inforce the wide precision in the numbers, note that UH_P & UH_PV should theoretically highlight term queries with identical code (PV gets internally optimized to P when there are no wildcards) yet it differed 0.61 to 0.52 (17%). Quoting myself earlier: For the surface classes users use: Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter. – I don't think it good to have users use parts of another highlighter (postingshighlight), which is weird for users. I propose copying these with a leading 'U', i.e. UPassage etc. That said if others think that's a worse trade-off, it's no big deal to me. Once o.a.l.s.ph.Passage's constructor is public, it's possible to do that. Reconsidering... I think it's fine to use these classes from the PostingsHighlighter... after all, these classes are only actually seen by users when they want to customize them. A user can simply call public methods on UH and reference zero other classes (getting back strings based on default impls of those classes). If there are no objections to this path, Tim/I can update the PR along with the other changes listed in the same comment I just referenced. BTW this highlighter has been in production for about 6 months. It rooted out a couple bugs.
          Hide
          rpedela Ryan Pedela added a comment - - edited

          I am very happy to see this. I use Elasticsearch, and I currently use the experimental highlighter plugin for three reasons.

          1. It uses either term vectors or postings to increase performance.
          2. It has fragment and sentence modes.
          3. The sentence mode produces significantly better highlights than the postings highlighter in my experience.

          I would prefer to use an official highlighter and happy to see that the UnifiedHighlighter will take care of #1 and #2. Now I would like to talk about #3.

          I don't know the specifics of the algorithm, but the experimental highlighter appears to take proximity and a keyword's document position into account. One example from memory, I had a medical research paper about warfarin and the highlight returned by the postings highlighter for the search "warfarin" came from the references. However the experimental highlighter returned a highlight near the beginning of the paper and it was a pretty good summary of the paper.

          There is also room for improvement for both the experimental and postings highlighters. They both appear to use the same sentence fragmenter which does not do a good job with abbreviations and decimal points. Would something like Stanford CoreNLP help?

          Also I would like a highlighter that tries to get as many keywords as possible into the highlight, at least as a config option. That is hard if only returning a single sentence or fragment. However I often want three fragments and I would like the union of the three fragments to contain all the keywords or as many as possible. For example, I am working on a search engine for SEC filings and a user searched "BPL hedgings" during a user test. BPL is the stock ticker for Buckeye Partners, and stock tickers are pretty unique within the SEC filings. The experimental highlighter returned three fragments with "BPL" but no "hedgings" (fast vector highlighter produced similar fragments). The user was very confused because they didn't see the word "hedgings" in the highlight and thought it wasn't found even though it was. To fix this, I retrieve the top 100 fragments and post-process them to find the best 3 fragments which contain the most keywords collectively. The post processing is quite naive since it does not understand proximity, stemming, etc. I would prefer if Lucene or ES did it because it can be much smarter.

          Show
          rpedela Ryan Pedela added a comment - - edited I am very happy to see this. I use Elasticsearch, and I currently use the experimental highlighter plugin for three reasons. 1. It uses either term vectors or postings to increase performance. 2. It has fragment and sentence modes. 3. The sentence mode produces significantly better highlights than the postings highlighter in my experience. I would prefer to use an official highlighter and happy to see that the UnifiedHighlighter will take care of #1 and #2. Now I would like to talk about #3. I don't know the specifics of the algorithm, but the experimental highlighter appears to take proximity and a keyword's document position into account. One example from memory, I had a medical research paper about warfarin and the highlight returned by the postings highlighter for the search "warfarin" came from the references. However the experimental highlighter returned a highlight near the beginning of the paper and it was a pretty good summary of the paper. There is also room for improvement for both the experimental and postings highlighters. They both appear to use the same sentence fragmenter which does not do a good job with abbreviations and decimal points. Would something like Stanford CoreNLP help? Also I would like a highlighter that tries to get as many keywords as possible into the highlight, at least as a config option. That is hard if only returning a single sentence or fragment. However I often want three fragments and I would like the union of the three fragments to contain all the keywords or as many as possible. For example, I am working on a search engine for SEC filings and a user searched "BPL hedgings" during a user test. BPL is the stock ticker for Buckeye Partners, and stock tickers are pretty unique within the SEC filings. The experimental highlighter returned three fragments with "BPL" but no "hedgings" (fast vector highlighter produced similar fragments). The user was very confused because they didn't see the word "hedgings" in the highlight and thought it wasn't found even though it was. To fix this, I retrieve the top 100 fragments and post-process them to find the best 3 fragments which contain the most keywords collectively. The post processing is quite naive since it does not understand proximity, stemming, etc. I would prefer if Lucene or ES did it because it can be much smarter.
          Hide
          dsmiley David Smiley added a comment -

          Thanks for your input Ryan! I have not heard of Wikimedia's "experimental highlighter"; maybe I'll go poke around there to see what it's doing.

          RE snippet delineation: This is customizable in the PH & UH & FVH via supplying a java.text.BreakIterator.

          RE snippets with multiple keywords: Yes! I've had that thought as well and some months ago I filed a TODO in my wish list of highlighter features to add a "coordination factor" to the UH algorithm (which at the moment is identical to the PH). A simple coord factor would help a lot, but even better would be one that also considered term diversity across the multiple passages you ask for rather than scoring each separately. To do this, it might internally track it's the leading candidate passage per term, in addition to the PriorityQueue it has now. This would be very low overhead.

          Show
          dsmiley David Smiley added a comment - Thanks for your input Ryan! I have not heard of Wikimedia's "experimental highlighter"; maybe I'll go poke around there to see what it's doing. RE snippet delineation: This is customizable in the PH & UH & FVH via supplying a java.text.BreakIterator. RE snippets with multiple keywords: Yes! I've had that thought as well and some months ago I filed a TODO in my wish list of highlighter features to add a "coordination factor" to the UH algorithm (which at the moment is identical to the PH). A simple coord factor would help a lot, but even better would be one that also considered term diversity across the multiple passages you ask for rather than scoring each separately. To do this, it might internally track it's the leading candidate passage per term, in addition to the PriorityQueue it has now. This would be very low overhead.
          Hide
          dsmiley David Smiley added a comment -

          BTW I reviewed the W.E.H. and posted a comparison: https://github.com/wikimedia/search-highlighter/issues/19

          Show
          dsmiley David Smiley added a comment - BTW I reviewed the W.E.H. and posted a comparison: https://github.com/wikimedia/search-highlighter/issues/19
          Hide
          Timothy055 Timothy M. Rodriguez added a comment -

          After further consideration, it seems best to leave some of the classes common between Postings and the Unified highlighters separate. If we were to use the same classes they'd ideally move to a common sub-package that both could share and this would introduce unneeded change and hurt potential compatibility for any users of those classes. Keeping them separate also allows for a possible improvement to the method highlightFieldsAsObjects which internally creates a Map that is promptly thrown away again in the highlight methods. I briefly investigated changing this to return the internal Object[][] array and avoid the extra Map allocation, but this creates some awkwardness since the Object[][] array sorts the input fields before filling the arrays, which would make the API somewhat of a trap for callers. This undesired behavior is likely why the map is being created. One way to fix this is to generify PassageFormatter over it's output type which would allow for a PassageFormatter<String> in the case of the DefaultPassageFormatter. However, changing this is a rather involved change that could ultimately result in the UnifiedHighlighter itself having a generic type and it was not clear that muddying the waters with that right now was a good idea. However, keeping these classes separate will allow for an attempt at that in the future.

          In the meantime, I've also pushed a commit to reduce the visibility of the MultiTermHighlighting to package protected. As it stands, I think this patch is ready.

          Show
          Timothy055 Timothy M. Rodriguez added a comment - After further consideration, it seems best to leave some of the classes common between Postings and the Unified highlighters separate. If we were to use the same classes they'd ideally move to a common sub-package that both could share and this would introduce unneeded change and hurt potential compatibility for any users of those classes. Keeping them separate also allows for a possible improvement to the method highlightFieldsAsObjects which internally creates a Map that is promptly thrown away again in the highlight methods. I briefly investigated changing this to return the internal Object[][] array and avoid the extra Map allocation, but this creates some awkwardness since the Object[][] array sorts the input fields before filling the arrays, which would make the API somewhat of a trap for callers. This undesired behavior is likely why the map is being created. One way to fix this is to generify PassageFormatter over it's output type which would allow for a PassageFormatter<String> in the case of the DefaultPassageFormatter. However, changing this is a rather involved change that could ultimately result in the UnifiedHighlighter itself having a generic type and it was not clear that muddying the waters with that right now was a good idea. However, keeping these classes separate will allow for an attempt at that in the future. In the meantime, I've also pushed a commit to reduce the visibility of the MultiTermHighlighting to package protected. As it stands, I think this patch is ready.
          Hide
          dsmiley David Smiley added a comment -

          (I'm attaching the patch)
          All new files; no changes to anything existing.

          I plan to commit Tuesday to give even more time for review.

          I'd also like to commit the patch for the benchmark module (but without the query files polluting the file listing?). However I think for it to be okay, it needs to go further and remove the way highlighters were benchmarked before this, since it's too hacky/weird to see both, particularly since the existing mechanism has hooks into ReadTask (getBenchmarkHighlighter()). I figure the entire benchmark module can change at our will without back-compat concern.

          While looking at the FVH and WEH I noticed a feature in which term vecs from multiple fields can be used to highlight one field – useful when you analyze the text in different ways into different fields (e.g. stemming vs not). We're actually doing that with the UH in Bloomberg (offset source agnostic of course) but I didn't think to add it as a first-class feature to the UH. Now I think we should in a follow-up issue. I think that requirement is causing us to want things like StrictPhraseHelper to be public but it could be moved to package protected then, I think.

          Show
          dsmiley David Smiley added a comment - (I'm attaching the patch) All new files; no changes to anything existing. I plan to commit Tuesday to give even more time for review. I'd also like to commit the patch for the benchmark module (but without the query files polluting the file listing?). However I think for it to be okay, it needs to go further and remove the way highlighters were benchmarked before this, since it's too hacky/weird to see both, particularly since the existing mechanism has hooks into ReadTask (getBenchmarkHighlighter()). I figure the entire benchmark module can change at our will without back-compat concern. While looking at the FVH and WEH I noticed a feature in which term vecs from multiple fields can be used to highlight one field – useful when you analyze the text in different ways into different fields (e.g. stemming vs not). We're actually doing that with the UH in Bloomberg (offset source agnostic of course) but I didn't think to add it as a first-class feature to the UH. Now I think we should in a follow-up issue. I think that requirement is causing us to want things like StrictPhraseHelper to be public but it could be moved to package protected then, I think.
          Hide
          dsmiley David Smiley added a comment -

          Attached is a small update to the UH – the patch will apply on top of the main patch.

          • fixed ant precommit issue – just TestUnifiedHighlighterExtensibility was affected
          • TestUnifiedHighlighterExtensibility was actually referring to some methods that should not be tested for extensibility. I think Tim forgot to remove them as we already discussed it.
          • Moved some logic from UH.getFieldHighlighter into UH.getOffsetStrategy which I think makes sense since that setup was only applicable to getOffsetStrategy, and furthermore it paves the way to making a multi-field offset strategy more obvious (to be done in a follow-up issue, which I'm looking forward to). I adjusted the method declaration order to read top-down.
          Show
          dsmiley David Smiley added a comment - Attached is a small update to the UH – the patch will apply on top of the main patch. fixed ant precommit issue – just TestUnifiedHighlighterExtensibility was affected TestUnifiedHighlighterExtensibility was actually referring to some methods that should not be tested for extensibility. I think Tim forgot to remove them as we already discussed it. Moved some logic from UH.getFieldHighlighter into UH.getOffsetStrategy which I think makes sense since that setup was only applicable to getOffsetStrategy, and furthermore it paves the way to making a multi-field offset strategy more obvious (to be done in a follow-up issue, which I'm looking forward to). I adjusted the method declaration order to read top-down.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 722e82712435ecf46c9868137d885484152f749b in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=722e827 ]

          LUCENE-7438: New UnifiedHighlighter

          Show
          jira-bot ASF subversion and git services added a comment - Commit 722e82712435ecf46c9868137d885484152f749b in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=722e827 ] LUCENE-7438 : New UnifiedHighlighter
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4b6794368df373df1f68ccf27f7556914efeb95e in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b67943 ]

          LUCENE-7438: New UnifiedHighlighter

          (cherry picked from commit 722e827)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4b6794368df373df1f68ccf27f7556914efeb95e in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b67943 ] LUCENE-7438 : New UnifiedHighlighter (cherry picked from commit 722e827)
          Hide
          dsmiley David Smiley added a comment -

          This is an update to the benchmark patch. I removed the existing benchmark highlighting abstraction (that I felt was a bit obsolete), and with it the existing two highlighting benchmark classes: SearchTravRetHighlightTask, SearchTravRetVectorHighlightTask. The patch actually replaces SearchTravRetHighlightTask with the one from the previous patch, and so by class name it still exists, but is internally very different as it tests all highlighters in all offset modes. It has the 2 highlighters-*.alg added in the last patch, and I kept the 3 query-*.txt files too. I removed the existing highlight .alg files except for one which I updated – standard-highlights-notv.alg -> highlights.alg. I also added a "UH" highlight mode to the benchmark, which is the UH's default mode operation in which it detects the offset source based on FieldInfo.

          I tweaked the build.xml & .gitignore to avoid work/ and temp/ and to allow them to be symbolic links.

          The only thing I feel bad about was outright removing some tests related to the old highlight abstraction... meanwhile there are no new tests for this new one. I rationalize this as it's better to finally have a more up-to-date way to highlight all highlighters in all modes (and in a consistent way) than it is to have something incomplete that is nevertheless tested.

          I'll commit this in a couple days.

          Show
          dsmiley David Smiley added a comment - This is an update to the benchmark patch. I removed the existing benchmark highlighting abstraction (that I felt was a bit obsolete), and with it the existing two highlighting benchmark classes: SearchTravRetHighlightTask, SearchTravRetVectorHighlightTask. The patch actually replaces SearchTravRetHighlightTask with the one from the previous patch, and so by class name it still exists, but is internally very different as it tests all highlighters in all offset modes. It has the 2 highlighters-*.alg added in the last patch, and I kept the 3 query-*.txt files too. I removed the existing highlight .alg files except for one which I updated – standard-highlights-notv.alg -> highlights.alg. I also added a "UH" highlight mode to the benchmark, which is the UH's default mode operation in which it detects the offset source based on FieldInfo. I tweaked the build.xml & .gitignore to avoid work/ and temp/ and to allow them to be symbolic links. The only thing I feel bad about was outright removing some tests related to the old highlight abstraction... meanwhile there are no new tests for this new one. I rationalize this as it's better to finally have a more up-to-date way to highlight all highlighters in all modes (and in a consistent way) than it is to have something incomplete that is nevertheless tested. I'll commit this in a couple days.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 5ef60af9c18624a317d5f3e8e331b2bb83c569db in lucene-solr's branch refs/heads/master from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ef60af ]

          LUCENE-7438: Renovate benchmark module's support for highlighting

          Show
          jira-bot ASF subversion and git services added a comment - Commit 5ef60af9c18624a317d5f3e8e331b2bb83c569db in lucene-solr's branch refs/heads/master from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5ef60af ] LUCENE-7438 : Renovate benchmark module's support for highlighting
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 3497a2902c198b8092b4b0352650e58543b296b5 in lucene-solr's branch refs/heads/branch_6x from David Smiley
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3497a29 ]

          LUCENE-7438: Renovate benchmark module's support for highlighting

          (cherry picked from commit 5ef60af)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 3497a2902c198b8092b4b0352650e58543b296b5 in lucene-solr's branch refs/heads/branch_6x from David Smiley [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3497a29 ] LUCENE-7438 : Renovate benchmark module's support for highlighting (cherry picked from commit 5ef60af)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 0414570348e5b8c8c3da8b1ad491b1b418a3756a in lucene-solr's branch refs/heads/master from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0414570 ]

          LUCENE-7438: fix broken build

          Show
          jira-bot ASF subversion and git services added a comment - Commit 0414570348e5b8c8c3da8b1ad491b1b418a3756a in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0414570 ] LUCENE-7438 : fix broken build
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f247acfab17a5363166eb601cb26243e74ca108c in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f247acf ]

          LUCENE-7438: fix broken build

          Show
          jira-bot ASF subversion and git services added a comment - Commit f247acfab17a5363166eb601cb26243e74ca108c in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f247acf ] LUCENE-7438 : fix broken build
          Hide
          dsmiley David Smiley added a comment -

          Thanks for fixing the build Mike! My bad.

          Show
          dsmiley David Smiley added a comment - Thanks for fixing the build Mike! My bad.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user Timothy055 closed the pull request at:

          https://github.com/apache/lucene-solr/pull/79

          Show
          githubbot ASF GitHub Bot added a comment - Github user Timothy055 closed the pull request at: https://github.com/apache/lucene-solr/pull/79
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Is this usable from Solr yet, or should a new issue be opened for that?
          A quick grep suggests there's nothing yet:

          find solr -type f | xargs grep -i UnifiedHighlighter
          
          Show
          yseeley@gmail.com Yonik Seeley added a comment - Is this usable from Solr yet, or should a new issue be opened for that? A quick grep suggests there's nothing yet: find solr -type f | xargs grep -i UnifiedHighlighter
          Hide
          Timothy055 Timothy M. Rodriguez added a comment -

          Not yet, we have an initial general implementation, but it's lacking tests. (We have a customized extension internally that does have tests.) I've created a new ticket https://issues.apache.org/jira/browse/SOLR-9708 with a PR containing the initial impl so folks can follow or help the work towards finishing it up. Thanks for asking though, hopefully this gets the ball rolling faster.

          Show
          Timothy055 Timothy M. Rodriguez added a comment - Not yet, we have an initial general implementation, but it's lacking tests. (We have a customized extension internally that does have tests.) I've created a new ticket https://issues.apache.org/jira/browse/SOLR-9708 with a PR containing the initial impl so folks can follow or help the work towards finishing it up. Thanks for asking though, hopefully this gets the ball rolling faster.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.3.0 release.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.3.0 release.

            People

            • Assignee:
              dsmiley David Smiley
              Reporter:
              Timothy055 Timothy M. Rodriguez
            • Votes:
              8 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development