Lucene - Core
  1. Lucene - Core
  2. LUCENE-1889

FastVectorHighlighter: support for additional queries

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5, 4.0-ALPHA
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I am using fastvectorhighlighter for some strange languages and it is working well!

      One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc)
      Here is one thing Michael M posted in the original ticket:

      I think a nice [eventual] model would be if we could simply re-run the
      scorer on the single document (using InstantiatedIndex maybe, or
      simply some sort of wrapper on the term vectors which are already a
      mini-inverted-index for a single doc), but extend the scorer API to
      tell us the exact term occurrences that participated in a match (which
      I don't think is exposed today).

      Due to strange requirements I am using something similar to this (but specialized to our case).
      I am doing strange things like forcing multitermqueries to rewrite into boolean queries so they will be highlighted,
      and flattening multiphrasequeries into boolean or'ed phrasequeries.
      I do not think these things would be 'fast', but i had a few ideas that might help:

      • looking at contrib/highlighter, you can support FilteredQuery in flatten() by calling getQuery() right?
      • maybe as a last resort, try Query.extractTerms() ?
      1. LUCENE-1889.patch
        18 kB
        Mike Sokolov
      2. LUCENE-1889.patch
        14 kB
        Mike Sokolov
      3. LUCENE-1889.patch
        15 kB
        Mike Sokolov
      4. LUCENE-1889_reader.patch
        2 kB
        Robert Muir
      5. LUCENE-1889-solr.patch
        0.8 kB
        Mike Sokolov

        Issue Links

          Activity

          Hide
          Bill Bell added a comment -

          It looks like this will automatically be used in SOLR ? OR do we need to add support for this ?

          Show
          Bill Bell added a comment - It looks like this will automatically be used in SOLR ? OR do we need to add support for this ?
          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5
          Hide
          Koji Sekiguchi added a comment -

          The fix of the broken solr build had been committed in last month.

          Show
          Koji Sekiguchi added a comment - The fix of the broken solr build had been committed in last month.
          Hide
          Mike Sokolov added a comment -

          Sorry, forgot to include changes to DefaultSolrHighlighter as well (it gets confusing maintaining multiple patches in the same build).

          I do think the non-reader method should be derprecated as in Robert's comment.

          Show
          Mike Sokolov added a comment - Sorry, forgot to include changes to DefaultSolrHighlighter as well (it gets confusing maintaining multiple patches in the same build). I do think the non-reader method should be derprecated as in Robert's comment.
          Hide
          Robert Muir added a comment -

          here is the patch I applied, might not be the best or whatever, and see the TODO/note in the code.

          Show
          Robert Muir added a comment - here is the patch I applied, might not be the best or whatever, and see the TODO/note in the code.
          Hide
          Chris Male added a comment -

          Solr's build is broken by these changes it seems.

          Show
          Chris Male added a comment - Solr's build is broken by these changes it seems.
          Hide
          Koji Sekiguchi added a comment -

          Thanks Mike!

          Show
          Koji Sekiguchi added a comment - Thanks Mike!
          Hide
          Koji Sekiguchi added a comment -

          Ok, I took the latter way in this issue. committed 3x.

          Show
          Koji Sekiguchi added a comment - Ok, I took the latter way in this issue. committed 3x.
          Hide
          Mike Sokolov added a comment -

          I'm a little out of my depth here, but it appears from poking around in svn history that the field name got pulled up in to MTQ from its subclasses when the "flex_1458" branch was merged in, and I guess that hasn't happened in 3.x?

          So it looks to me that the only way to get this change into 3.x is either to do the same kind of refactoring that was done in that branch, or to go back to special-casing all the different MTQs. (ie something like:

          else if (query instanceof WildcardQuery) {
            return ((WildcardQuery)query).getField();
          }
          else if (query instanceof PrefixQuery) {
            return ((PrefixQuery)query).getField();
          }
          etc...
          

          sorry - I didn't realize the patch relied on something only in trunk. That branch was merged in a year and a half ago now, but I guess it must include enough major changes that it's difficult to back-port.

          Show
          Mike Sokolov added a comment - I'm a little out of my depth here, but it appears from poking around in svn history that the field name got pulled up in to MTQ from its subclasses when the "flex_1458" branch was merged in, and I guess that hasn't happened in 3.x? So it looks to me that the only way to get this change into 3.x is either to do the same kind of refactoring that was done in that branch, or to go back to special-casing all the different MTQs. (ie something like: else if (query instanceof WildcardQuery) { return ((WildcardQuery)query).getField(); } else if (query instanceof PrefixQuery) { return ((PrefixQuery)query).getField(); } etc... sorry - I didn't realize the patch relied on something only in trunk. That branch was merged in a year and a half ago now, but I guess it must include enough major changes that it's difficult to back-port.
          Hide
          Koji Sekiguchi added a comment -

          Committed revision 1166954 in trunk.

          Now I'm trying to port 3x, but I got "no such method" issue in FieldQuery:

          else if (query instanceof MultiTermQuery) {
            return ((MultiTermQuery)query).getField();
          }
          

          Do you have an idea to solve this?

          Show
          Koji Sekiguchi added a comment - Committed revision 1166954 in trunk. Now I'm trying to port 3x, but I got "no such method" issue in FieldQuery: else if (query instanceof MultiTermQuery) { return ((MultiTermQuery)query).getField(); } Do you have an idea to solve this?
          Hide
          Mike Sokolov added a comment -

          here you go, Koji - thanks for the quick review

          Show
          Mike Sokolov added a comment - here you go, Koji - thanks for the quick review
          Hide
          Mike Sokolov added a comment -

          updated patch resolves issue w/possibly rewriting MTQs multiple times

          Show
          Mike Sokolov added a comment - updated patch resolves issue w/possibly rewriting MTQs multiple times
          Hide
          Koji Sekiguchi added a comment -

          Patch looks good!

          Possibly LUCENE-2878 will end up as something better, but it is a longer-term project I think with a lot of work left to be done; in the meantime this would offer a good way to extend FVH support to a broader range of queries.

          +1. I think this is close to commit. Mike, can you sort out FIXME part?

          Show
          Koji Sekiguchi added a comment - Patch looks good! Possibly LUCENE-2878 will end up as something better, but it is a longer-term project I think with a lot of work left to be done; in the meantime this would offer a good way to extend FVH support to a broader range of queries. +1. I think this is close to commit. Mike, can you sort out FIXME part?
          Hide
          Koji Sekiguchi added a comment -

          Thank you for the wake-up call, Mike! I'd like to review the patch tomorrow.

          Show
          Koji Sekiguchi added a comment - Thank you for the wake-up call, Mike! I'd like to review the patch tomorrow.
          Hide
          Mike Sokolov added a comment -

          Koji - back in June, you expressed some interest in this; are you still interested / do you have time to review the patch? I think it is as good a solution as one can wish for in the context of FVH. Possibly LUCENE-2878 will end up as something better, but it is a longer-term project I think with a lot of work left to be done; in the meantime this would offer a good way to extend FVH support to a broader range of queries. This'll be my last nudge

          Show
          Mike Sokolov added a comment - Koji - back in June, you expressed some interest in this; are you still interested / do you have time to review the patch? I think it is as good a solution as one can wish for in the context of FVH. Possibly LUCENE-2878 will end up as something better, but it is a longer-term project I think with a lot of work left to be done; in the meantime this would offer a good way to extend FVH support to a broader range of queries. This'll be my last nudge
          Hide
          Mike Sokolov added a comment -

          This patch adds support for highlighting MultiTermQuery in FastVectorHighlighter via Query.rewrite(). I left one FIXME (should that be nocommit?) that should be fairly easy to resolve: we currently rewrite() the same MTQ query twice in some circumstances - if it's in a phrase I think. I'd be happy to sort that out if y'all decide to commit this.

          Show
          Mike Sokolov added a comment - This patch adds support for highlighting MultiTermQuery in FastVectorHighlighter via Query.rewrite(). I left one FIXME (should that be nocommit?) that should be fairly easy to resolve: we currently rewrite() the same MTQ query twice in some circumstances - if it's in a phrase I think. I'd be happy to sort that out if y'all decide to commit this.
          Hide
          Robert Muir added a comment -

          well I think Simon might be looking for feedback on LUCENE-2878, which would allow you to get at the positions and corresponding payloads.

          So as an experiment close to what you describe, you could play with his patch, make a TokenFilter that copies whatever offset info highlighting needs into the payload (OffsetAsPayloadFilter or something), and try to make a quick-n-dirty highlighter that uses it?

          It would be interesting to see what the performance is like from this versus the term vectors, besides working with all queries

          Show
          Robert Muir added a comment - well I think Simon might be looking for feedback on LUCENE-2878 , which would allow you to get at the positions and corresponding payloads. So as an experiment close to what you describe, you could play with his patch, make a TokenFilter that copies whatever offset info highlighting needs into the payload (OffsetAsPayloadFilter or something), and try to make a quick-n-dirty highlighter that uses it? It would be interesting to see what the performance is like from this versus the term vectors, besides working with all queries
          Hide
          Mike Sokolov added a comment -

          Ah, I see - that's awesome, thanks, had no idea. Yeah - I had been thinking about matching positions->offsets using the existing term vectors, which was going to be kind of unpleasant; you have to iterate by term, which you don't care about, and scan for a matching position.

          Show
          Mike Sokolov added a comment - Ah, I see - that's awesome, thanks, had no idea. Yeah - I had been thinking about matching positions->offsets using the existing term vectors, which was going to be kind of unpleasant; you have to iterate by term, which you don't care about, and scan for a matching position.
          Hide
          Robert Muir added a comment -

          Hi Mike, Simon has an issue open to make a lot of what you are talking about wrt positions easier:
          LUCENE-2878

          In my opinion once LUCENE-2878 is resolved, we may want to then consider adding the capability for a codec to encode the offset deltas in parallel with the positions (so its just a stream of delta-encoded integers you read in parallel with the positions for things like highlighting).

          Then, highlighting would not require term vectors anymore right? I think this would be much faster and more efficient without the space waste of term vectors, and we could prototype such a thing by encoding these ourselves into the payloads... which is close to the same, but I think ultimately optionally supporting offsets this way will be better especially with block-oriented compression algorithms.

          Show
          Robert Muir added a comment - Hi Mike, Simon has an issue open to make a lot of what you are talking about wrt positions easier: LUCENE-2878 In my opinion once LUCENE-2878 is resolved, we may want to then consider adding the capability for a codec to encode the offset deltas in parallel with the positions (so its just a stream of delta-encoded integers you read in parallel with the positions for things like highlighting). Then, highlighting would not require term vectors anymore right? I think this would be much faster and more efficient without the space waste of term vectors, and we could prototype such a thing by encoding these ourselves into the payloads... which is close to the same, but I think ultimately optionally supporting offsets this way will be better especially with block-oriented compression algorithms.
          Hide
          Mike Sokolov added a comment -

          Robert: Thanks that sounds like good advice. I wasn't completely happy with that Pattern list anyway; really still just feeling my way around Lucene and trying random things at this point a bit. I wonder if you could comment on this possible other idea, following up on Mike M's quote above:

          I tried hacking up SpanScorer to see if I could get positions out of it using a custom Collector, but found that by the time a doc was reported, SpanScorer had already iterated over and dropped the positions. I was thinking of adding a Collector.collectSpans(int start, int end), and having SpanScorer call it (it would be an empty function in Collector proper) or something like that. At this point I'm wondering if it might be possible to rewrite many queries as some kind of SpanQuery (using a visitor), without the need to actually alter all the Query implementations. Is there a better way?

          I was also thinking it might be possible to capture and re-use positions gathered during the initial scoring episode rather than having to re-score during highlighting, but I guess that's a separate issue.

          Koji: Thanks for the review, but it sounds like some more iteration is needed here; for sure on RegExpQuery. I probably should have tested that a bit more carefully, although the one thing I tried (character classes) seems to work the same.

          Show
          Mike Sokolov added a comment - Robert: Thanks that sounds like good advice. I wasn't completely happy with that Pattern list anyway; really still just feeling my way around Lucene and trying random things at this point a bit. I wonder if you could comment on this possible other idea, following up on Mike M's quote above: I tried hacking up SpanScorer to see if I could get positions out of it using a custom Collector, but found that by the time a doc was reported, SpanScorer had already iterated over and dropped the positions. I was thinking of adding a Collector.collectSpans(int start, int end), and having SpanScorer call it (it would be an empty function in Collector proper) or something like that. At this point I'm wondering if it might be possible to rewrite many queries as some kind of SpanQuery (using a visitor), without the need to actually alter all the Query implementations. Is there a better way? I was also thinking it might be possible to capture and re-use positions gathered during the initial scoring episode rather than having to re-score during highlighting, but I guess that's a separate issue. Koji: Thanks for the review, but it sounds like some more iteration is needed here; for sure on RegExpQuery. I probably should have tested that a bit more carefully, although the one thing I tried (character classes) seems to work the same.
          Hide
          Robert Muir added a comment -

          A possible issue is that regex support will differ from RegexpQuery, but I think? that Java's is a superset, so should be ok, but I'm not sure about this one.

          Actually, these are totally different syntaxes!

          An alternative way to flatten these multitermqueries could be to implement o.a.l.index.Terms with what is in the term vector... then you could rewrite them with their own code.

          trying to generate an equivalent string pattern could be a little problematic, for example wildcard supports escaped terms (and could contain other characters that are java.util.regex syntax characters but not wildcard syntax characters), the regex syntax is different, etc.

          if you still decide you want to do it this way though, i would use o.a.l.util.automaton instead of java.util.regex? Besides being faster, this is internally what these queries are using anyway, so you can convert them with for example WildcardQuery.toAutomaton(). Then, union these and match against the union'ed machine instead of a List.

          But personally i would look at going the Terms/rewriteMethod route if possible, this way all multitermqueries will "just work".

          Show
          Robert Muir added a comment - A possible issue is that regex support will differ from RegexpQuery, but I think? that Java's is a superset, so should be ok, but I'm not sure about this one. Actually, these are totally different syntaxes! An alternative way to flatten these multitermqueries could be to implement o.a.l.index.Terms with what is in the term vector... then you could rewrite them with their own code. trying to generate an equivalent string pattern could be a little problematic, for example wildcard supports escaped terms (and could contain other characters that are java.util.regex syntax characters but not wildcard syntax characters), the regex syntax is different, etc. if you still decide you want to do it this way though, i would use o.a.l.util.automaton instead of java.util.regex? Besides being faster, this is internally what these queries are using anyway, so you can convert them with for example WildcardQuery.toAutomaton(). Then, union these and match against the union'ed machine instead of a List. But personally i would look at going the Terms/rewriteMethod route if possible, this way all multitermqueries will "just work".
          Hide
          Koji Sekiguchi added a comment -

          Patch looks really good!

          To handle RangeQuery, you'd need to add another such data structure: it would probably be best to introduce some new abstraction to represent all of these query-proxies.

          Would you like to try this one?

          It seemed a less useful case to me anyway since we don't usually use range queries in the context of full text; more often they come up in structured metadata? Curious if you have requests for that?

          I don't have the requirement for highlighting range queries, even wildcard, prefix and regexp either. Because I'm using FVH to highlight terms in N-gram fields, and these MultiTermQueries are not ideal for N-gram. But if FVH could cover range queries, it should be nicer for users.

          Show
          Koji Sekiguchi added a comment - Patch looks really good! To handle RangeQuery, you'd need to add another such data structure: it would probably be best to introduce some new abstraction to represent all of these query-proxies. Would you like to try this one? It seemed a less useful case to me anyway since we don't usually use range queries in the context of full text; more often they come up in structured metadata? Curious if you have requests for that? I don't have the requirement for highlighting range queries, even wildcard, prefix and regexp either. Because I'm using FVH to highlight terms in N-gram fields, and these MultiTermQueries are not ideal for N-gram. But if FVH could cover range queries, it should be nicer for users.
          Hide
          Mike Sokolov added a comment -

          Patch includes FVH support for Wildcard-, Regexp- and PrefixQuery. Change to Enwiki benchmark (to generate wildcard queries) should maybe not be committed; just providing this as a validation of this approach.

          Show
          Mike Sokolov added a comment - Patch includes FVH support for Wildcard-, Regexp- and PrefixQuery. Change to Enwiki benchmark (to generate wildcard queries) should maybe not be committed; just providing this as a validation of this approach.
          Hide
          Mike Sokolov added a comment -

          No, no range queries, sorry. I don't think that's easily expressible as a regex? So it would add probably require yet another data structure in FieldQuery - right now we have Map<String,QueryPhraseMap> for TermQuery; I've added a List<QueryPhraseMap> and QPM.Pattern for matching wildcards and regexes. To handle RangeQuery, you'd need to add another such data structure: it would probably be best to introduce some new abstraction to represent all of these query-proxies.

          It seemed a less useful case to me anyway since we don't usually use range queries in the context of full text; more often they come up in structured metadata? Curious if you have requests for that?

          Anyway I will clean up a bit and post.

          Show
          Mike Sokolov added a comment - No, no range queries, sorry. I don't think that's easily expressible as a regex? So it would add probably require yet another data structure in FieldQuery - right now we have Map<String,QueryPhraseMap> for TermQuery; I've added a List<QueryPhraseMap> and QPM.Pattern for matching wildcards and regexes. To handle RangeQuery, you'd need to add another such data structure: it would probably be best to introduce some new abstraction to represent all of these query-proxies. It seemed a less useful case to me anyway since we don't usually use range queries in the context of full text; more often they come up in structured metadata? Curious if you have requests for that? Anyway I will clean up a bit and post.
          Hide
          Koji Sekiguchi added a comment -

          Sounds great!

          This doesn't take you to nirvana, but does add support for a common case. If there's interest I'll post.

          Sure, please post. Does it cover range queries?

          Show
          Koji Sekiguchi added a comment - Sounds great! This doesn't take you to nirvana, but does add support for a common case. If there's interest I'll post. Sure, please post. Does it cover range queries?
          Hide
          Mike Sokolov added a comment -

          I made an incremental change to FVH to support WildcardQuery, PrefixQuery and RegexpQuery. Just uses Java regexes to match. It is faster than HighlightQuery (4-5x in enwiki benchmark) although not as much faster as the comparison w/TermQuery.

          A possible issue is that regex support will differ from RegexpQuery, but I think? that Java's is a superset, so should be ok, but I'm not sure about this one.

          This doesn't take you to nirvana, but does add support for a common case. If there's interest I'll post.

          Show
          Mike Sokolov added a comment - I made an incremental change to FVH to support WildcardQuery, PrefixQuery and RegexpQuery. Just uses Java regexes to match. It is faster than HighlightQuery (4-5x in enwiki benchmark) although not as much faster as the comparison w/TermQuery. A possible issue is that regex support will differ from RegexpQuery, but I think? that Java's is a superset, so should be ok, but I'm not sure about this one. This doesn't take you to nirvana, but does add support for a common case. If there's interest I'll post.
          Hide
          Digy added a comment -

          One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc)

          I am using queryParser.setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE)
          before query.rewrite, and it works well.

          DIGY

          Show
          Digy added a comment - One thing i noticed immediately is that many query types are not highlighted (multitermquery, multiphrasequery, etc) I am using queryParser.setMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE) before query.rewrite, and it works well. DIGY
          Hide
          Michael McCandless added a comment -

          I think we "just" need to merge Span*Query into their "nomal" counterparts, making sure there's no performance penalty when you don't use the spans. Then we get the exact occurrence of every match "for free"

          Show
          Michael McCandless added a comment - I think we "just" need to merge Span*Query into their "nomal" counterparts, making sure there's no performance penalty when you don't use the spans. Then we get the exact occurrence of every match "for free"
          Hide
          Robert Muir added a comment -

          Jason, no but the high-level idea in concept is similar: re-run the query on single doc "mini-index" to work a bit differently (specialized for highlighting)

          if i had done it in a nice way I would have contributed something

          Show
          Robert Muir added a comment - Jason, no but the high-level idea in concept is similar: re-run the query on single doc "mini-index" to work a bit differently (specialized for highlighting) if i had done it in a nice way I would have contributed something
          Hide
          Jason Rutherglen added a comment -

          Robert, you've implemented extending scorer to return the exact term occurrences?

          Show
          Jason Rutherglen added a comment - Robert, you've implemented extending scorer to return the exact term occurrences?

            People

            • Assignee:
              Koji Sekiguchi
              Reporter:
              Robert Muir
            • Votes:
              4 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development