Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.2, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Add support for CommonTermsQuery to all highlighter impls.
      This might add a dependency (query-jar) to the highlighter so we might think about adding it to core?

      1. LUCENE-4728.patch
        28 kB
        Simon Willnauer
      2. LUCENE-4728.patch
        24 kB
        Simon Willnauer
      3. LUCENE-4728.patch
        13 kB
        Simon Willnauer
      4. LUCENE-4728.patch
        11 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          patch adding support and tests

          Show
          Simon Willnauer added a comment - patch adding support and tests
          Hide
          Robert Muir added a comment -

          In the case of these highlighters with huge instanceof's across all queries, instead of doing things like:

          +    } else if (sourceQuery instanceof CommonTermsQuery) {
          +      flatten(sourceQuery.rewrite(reader), reader, flatQueries);
               }
          

          can't it just be something like:

          +    } else if (!sourceQuery instanceof MultiTermQuery) {
          +      // an unknown query type, see if it rewrites into something we know.
          +      flatten(sourceQuery.rewrite(reader), reader, flatQueries);
               }
          

          then we wouldn't need explicit dependencies in these highlighters on every single query

          Show
          Robert Muir added a comment - In the case of these highlighters with huge instanceof's across all queries, instead of doing things like: + } else if (sourceQuery instanceof CommonTermsQuery) { + flatten(sourceQuery.rewrite(reader), reader, flatQueries); } can't it just be something like: + } else if (!sourceQuery instanceof MultiTermQuery) { + // an unknown query type, see if it rewrites into something we know. + flatten(sourceQuery.rewrite(reader), reader, flatQueries); } then we wouldn't need explicit dependencies in these highlighters on every single query
          Hide
          Simon Willnauer added a comment -

          the reason why I didn't do that is that this query simply stops flattening if it doesn't know the query so if we turn this around we need to make sure that we don't end up in an endless loop rewriting all the time. I mean we can do this and add another check but I would feel safer to have these explicit checks to be honest. I don't think this is a bottle neck of any sort so while I don't really like it I think its ok though.

          Show
          Simon Willnauer added a comment - the reason why I didn't do that is that this query simply stops flattening if it doesn't know the query so if we turn this around we need to make sure that we don't end up in an endless loop rewriting all the time. I mean we can do this and add another check but I would feel safer to have these explicit checks to be honest. I don't think this is a bottle neck of any sort so while I don't really like it I think its ok though.
          Hide
          Robert Muir added a comment -

          the reason why I didn't do that is that this query simply stops flattening if it doesn't know the query so if we turn this around we need to make sure that we don't end up in an endless loop rewriting all the time.

          I don't see the problem, we only need to rewrite it until it 'stops', just like indexsearcher does:

              Query query = original;
              for (Query rewrittenQuery = query.rewrite(reader); rewrittenQuery != query;
                   rewrittenQuery = query.rewrite(reader)) {
                query = rewrittenQuery;
              }
          

          I don't think this is a bottle neck of any sort

          I do think its a bottleneck, in that if we only did this simple thing, then highlighters wouldnt have to have hard dependencies on every concrete query we add. They would also work with custom queries that users make that rewrite to graphs of core queries (this one is a perfect example!)

          There isn't a need for the highlighting module to depend on the queries module when it can do this with the abstract API already in core.

          Show
          Robert Muir added a comment - the reason why I didn't do that is that this query simply stops flattening if it doesn't know the query so if we turn this around we need to make sure that we don't end up in an endless loop rewriting all the time. I don't see the problem, we only need to rewrite it until it 'stops', just like indexsearcher does: Query query = original; for (Query rewrittenQuery = query.rewrite(reader); rewrittenQuery != query; rewrittenQuery = query.rewrite(reader)) { query = rewrittenQuery; } I don't think this is a bottle neck of any sort I do think its a bottleneck, in that if we only did this simple thing, then highlighters wouldnt have to have hard dependencies on every concrete query we add. They would also work with custom queries that users make that rewrite to graphs of core queries (this one is a perfect example!) There isn't a need for the highlighting module to depend on the queries module when it can do this with the abstract API already in core.
          Hide
          Simon Willnauer added a comment -

          new patch - I added a comment there to make sure we don't run into stack overflow exceptions as well as a test for that!

          Show
          Simon Willnauer added a comment - new patch - I added a comment there to make sure we don't run into stack overflow exceptions as well as a test for that!
          Hide
          Robert Muir added a comment -

          nice: I like it! Is it possible to fix WeightedSpanTermExtractor the same way for the other highlighter?

          it seems like the default if it doesnt know the query is... to do nothing?

          Show
          Robert Muir added a comment - nice: I like it! Is it possible to fix WeightedSpanTermExtractor the same way for the other highlighter? it seems like the default if it doesnt know the query is... to do nothing?
          Hide
          Simon Willnauer added a comment -

          nice: I like it! Is it possible to fix WeightedSpanTermExtractor the same way for the other highlighter?

          this one is tricky since it involves getting the field from the query in order to fetch a reader. This involves instanceof checks again which doesn't buy use much. Fix positions on Query / Scorer would help you know I think this is ready, I will commit that soon.

          Show
          Simon Willnauer added a comment - nice: I like it! Is it possible to fix WeightedSpanTermExtractor the same way for the other highlighter? this one is tricky since it involves getting the field from the query in order to fetch a reader. This involves instanceof checks again which doesn't buy use much. Fix positions on Query / Scorer would help you know I think this is ready, I will commit that soon.
          Hide
          Robert Muir added a comment -

          wait i dont get it... its another instanceof block.

          I think we would just do the same trick in WeightedSpanTermExtractor, and then in the test have a mock query that rewrites to primitives (e.g. boolean query) and not have highlighter depend on the queries module?

          I dont think highlighters should depend on the concrete queries, only the abstract apis (just like i think modules shouldnt depend on analyzers modules)... otherwise its a sign something is wrong.

          Show
          Robert Muir added a comment - wait i dont get it... its another instanceof block. I think we would just do the same trick in WeightedSpanTermExtractor, and then in the test have a mock query that rewrites to primitives (e.g. boolean query) and not have highlighter depend on the queries module? I dont think highlighters should depend on the concrete queries, only the abstract apis (just like i think modules shouldnt depend on analyzers modules)... otherwise its a sign something is wrong.
          Hide
          Simon Willnauer added a comment -

          wait i dont get it... its another instanceof block.

          no not necessarily. Since for rewrite I need to get the field of the query otherwise I can't get a IndexReader in WeightedSpanTermExtractor. The other problem here is that WeightedSpanTermExtractor doesn't rewrite against a global reader but rather against a "reanalyzed" reader which might bring up problems in the case of CommonTermsQuery which will in-turn create a different BooleanQuery.

          I dont think highlighters should depend on the concrete queries, only the abstract apis (just like i think modules shouldnt depend on analyzers modules)... otherwise its a sign something is wrong.

          dude this is wishful thinking unless we fix our API to allow to do positional queries. Really we already rely on it with ConstantScore / FilteredQuery calling getQuery() we also rely on BQ etc. and TermQuery which is not abstract api.

          Show
          Simon Willnauer added a comment - wait i dont get it... its another instanceof block. no not necessarily. Since for rewrite I need to get the field of the query otherwise I can't get a IndexReader in WeightedSpanTermExtractor. The other problem here is that WeightedSpanTermExtractor doesn't rewrite against a global reader but rather against a "reanalyzed" reader which might bring up problems in the case of CommonTermsQuery which will in-turn create a different BooleanQuery. I dont think highlighters should depend on the concrete queries, only the abstract apis (just like i think modules shouldnt depend on analyzers modules)... otherwise its a sign something is wrong. dude this is wishful thinking unless we fix our API to allow to do positional queries. Really we already rely on it with ConstantScore / FilteredQuery calling getQuery() we also rely on BQ etc. and TermQuery which is not abstract api.
          Hide
          Uwe Schindler added a comment -

          Hi,
          I had the same problem while implementing a custom query for a customer. The query was very easy, it just rewrote after expanding terms to MultiPhraseQuery - you would expect that this works with highlighter! - But it doen't. The problem is that highligther does not even try to rewrite the query, it only checks via instanceof checks the original query type, failing to highlight my simple query without custom weights and scorers, just a very simple rewrite method. That is not a good design! If the highlighter would rewrite the query as a last chance this problem would have been solved. The problem with that is a second one in the crazy Lucene Highlighter: You need the field name for highlighter to work

          For this customer my only chance was to use Javassist to hot-patch the WeightedSpanTermExtractor and add another instanceof check. Overriding the fallback to handle other queries was impossible because the customer's framework was ElasticSearch which has a highly private, unextendable WeightedSpanTermExtractor with no possibility to override the Lucene default [same applies for Solr]

          This brings us back to a very old issue: We should extend the Query class by a simple additional API, so it can provide all metadata needed to do highlighting without crazy instaceof chains.

          Show
          Uwe Schindler added a comment - Hi, I had the same problem while implementing a custom query for a customer. The query was very easy, it just rewrote after expanding terms to MultiPhraseQuery - you would expect that this works with highlighter! - But it doen't. The problem is that highligther does not even try to rewrite the query, it only checks via instanceof checks the original query type, failing to highlight my simple query without custom weights and scorers, just a very simple rewrite method. That is not a good design! If the highlighter would rewrite the query as a last chance this problem would have been solved. The problem with that is a second one in the crazy Lucene Highlighter: You need the field name for highlighter to work For this customer my only chance was to use Javassist to hot-patch the WeightedSpanTermExtractor and add another instanceof check. Overriding the fallback to handle other queries was impossible because the customer's framework was ElasticSearch which has a highly private, unextendable WeightedSpanTermExtractor with no possibility to override the Lucene default [same applies for Solr] This brings us back to a very old issue: We should extend the Query class by a simple additional API, so it can provide all metadata needed to do highlighting without crazy instaceof chains.
          Hide
          Robert Muir added a comment -

          private Map<String,AtomicReaderContext> readers = new HashMap<String,AtomicReaderContext>(10);

          Well this is the core problem: per-field readers. Currently the only reasons multitermqueries rewrite is because
          they know the field.

          But isnt this easily solvable? we just wrap a ParallelReader around that.

          Show
          Robert Muir added a comment - private Map<String,AtomicReaderContext> readers = new HashMap<String,AtomicReaderContext>(10); Well this is the core problem: per-field readers. Currently the only reasons multitermqueries rewrite is because they know the field. But isnt this easily solvable? we just wrap a ParallelReader around that.
          Hide
          Uwe Schindler added a comment -

          But isnt this easily solvable? we just wrap a ParallelReader around that.

          +1 thats the solution!

          Show
          Uwe Schindler added a comment - But isnt this easily solvable? we just wrap a ParallelReader around that. +1 thats the solution!
          Hide
          Simon Willnauer added a comment -

          here is the - solves everybodies problem - patch. I added the rewrite trick to both highlighters, remove the field requirement in the WeightedSpanTermExtractor and hacked a reusing AtomicReader that simply delegates to the MemoryIndex AtomicReader shadowing the field. kind of hacky but that entire thing is hacky ey? Nice thing is that we don't need to index the tokenstream twice if we need two different fields.

          Show
          Simon Willnauer added a comment - here is the - solves everybodies problem - patch. I added the rewrite trick to both highlighters, remove the field requirement in the WeightedSpanTermExtractor and hacked a reusing AtomicReader that simply delegates to the MemoryIndex AtomicReader shadowing the field. kind of hacky but that entire thing is hacky ey? Nice thing is that we don't need to index the tokenstream twice if we need two different fields.
          Hide
          Simon Willnauer added a comment -

          any comments on this?

          Show
          Simon Willnauer added a comment - any comments on this?
          Hide
          Simon Willnauer added a comment -

          next iter, updated changes entry and cleaned up WeigthedSpanTermExtractor a bit. I will commit this in a bit.

          Show
          Simon Willnauer added a comment - next iter, updated changes entry and cleaned up WeigthedSpanTermExtractor a bit. I will commit this in a bit.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1442590

          LUCENE-4728: Unknown and not explicitly mapped queries are now rewritten against the highlighting IndexReader to obtain primitive queries before discarding the query entirely.

          Show
          Commit Tag Bot added a comment - [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1442590 LUCENE-4728 : Unknown and not explicitly mapped queries are now rewritten against the highlighting IndexReader to obtain primitive queries before discarding the query entirely.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1442599

          LUCENE-4728: Unknown and not explicitly mapped queries are now rewritten against the highlighting IndexReader to obtain primitive queries before discarding the query entirely.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1442599 LUCENE-4728 : Unknown and not explicitly mapped queries are now rewritten against the highlighting IndexReader to obtain primitive queries before discarding the query entirely.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1442606

          LUCENE-4728: add queries to the classpath since highlighter specializes a query now

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1442606 LUCENE-4728 : add queries to the classpath since highlighter specializes a query now
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Simon Willnauer
          http://svn.apache.org/viewvc?view=revision&revision=1442605

          LUCENE-4728: add queries to the classpath since highlighter specializes a query now

          Show
          Commit Tag Bot added a comment - [trunk commit] Simon Willnauer http://svn.apache.org/viewvc?view=revision&revision=1442605 LUCENE-4728 : add queries to the classpath since highlighter specializes a query now
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Steven Rowe
          http://svn.apache.org/viewvc?view=revision&revision=1447142

          LUCENE-4728: IntelliJ configuration: add queries module dependency to highlighter module (merged trunk r1447141)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1447142 LUCENE-4728 : IntelliJ configuration: add queries module dependency to highlighter module (merged trunk r1447141)
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Steven Rowe
          http://svn.apache.org/viewvc?view=revision&revision=1447141

          LUCENE-4728: IntelliJ configuration: add queries module dependency to highlighter module

          Show
          Commit Tag Bot added a comment - [trunk commit] Steven Rowe http://svn.apache.org/viewvc?view=revision&revision=1447141 LUCENE-4728 : IntelliJ configuration: add queries module dependency to highlighter module
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development