Lucene - Core
  1. Lucene - Core
  2. LUCENE-1425

Add ConstantScore highlighting support to SpanScorer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Its actually easy enough to support the family of constantscore queries with the new SpanScorer. This will also remove the requirement that you rewrite queries against the main index before highlighting (in fact, if you do, the constantscore queries will not highlight).

      1. LUCENE-1425.patch
        14 kB
        Mark Miller
      2. LUCENE-1425.patch
        14 kB
        Mark Miller
      3. LUCENE-1425.patch
        4 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          First go at it, should be feature complete, but not thought complete.

          Show
          Mark Miller added a comment - First go at it, should be feature complete, but not thought complete.
          Hide
          Yonik Seeley added a comment -

          It seems like we should separate the concepts of extracting terms from highlighting.
          There are scalability issues involved in enumerating all terms (both CPU and especially memory), and it seems like we should avoid that if possible. What about some kind of interface that would allow leaving these queries in symbolic form?

          class Query {
            public boolean matches(Token t);
          }
          
          Or create an object optimized for matching for better speed:
          
          class Query {
            public TokenMatcher getTokenMatcher(IndexReader r);
          }
          class TokenMatcher {
            public boolean matches(Token t);  // returns true if the token is part of this query
            public float score(Token t);  // returns <0 if no match
          }
          
          Or perhaps even better is something that takes a complete TokenStream and spits back matches:
          
          class TokenMatcher {
            public void setTokenStream(TokenStream ts);
          
            public boolean next();  // returns true if there is another match
            public int getStartOffset();
            public int getEndOffset();
            public float getScore();
          }
          

          Basically, you want highlighting for a clause like a* to boil down to tokenText.startsWith("a")
          Does that make sense?

          Show
          Yonik Seeley added a comment - It seems like we should separate the concepts of extracting terms from highlighting. There are scalability issues involved in enumerating all terms (both CPU and especially memory), and it seems like we should avoid that if possible. What about some kind of interface that would allow leaving these queries in symbolic form? class Query { public boolean matches(Token t); } Or create an object optimized for matching for better speed: class Query { public TokenMatcher getTokenMatcher(IndexReader r); } class TokenMatcher { public boolean matches(Token t); // returns true if the token is part of this query public float score(Token t); // returns <0 if no match } Or perhaps even better is something that takes a complete TokenStream and spits back matches: class TokenMatcher { public void setTokenStream(TokenStream ts); public boolean next(); // returns true if there is another match public int getStartOffset(); public int getEndOffset(); public float getScore(); } Basically, you want highlighting for a clause like a* to boil down to tokenText.startsWith("a") Does that make sense?
          Hide
          Mark Miller added a comment -

          Its a balancing act I suppose. Mark Harwood has suggested something similar in the past. Its ideally where we would like to end up I would think, but at the same time, it requires a lot more effort to implement the match logic in each query. I'll be happy to give it a shot as I can, but it would be nice to have some sort of support on a shorter time frame, even if inefficient. I think there are a lot of setups out there where you wont have enough matches in a document to consume that much memory. To play nice, we ship with constantscore scoring off with the option to enable. Technically, its not really going to be any less efficient than the current support for non constantscore wildcard,prefix,range (other than the instantiated index).

          We can certainly hold off for a while to see where a match solution might go, but if it doesn't get much momentum, I'd like to offer this option. Its a similar rationality to the Span Highlighter itself. It could have been done more efficiently if it broke the old API, and it could have been done right if we built in support to core Lucene, but the way it was done - its there and available and other ways are not.

          I'll let this stew and play around with the token matcher idea for a bit if/when/as time arises. SpanQueries, FilterQueries, ConstantScoreQueries, oh my.

          Show
          Mark Miller added a comment - Its a balancing act I suppose. Mark Harwood has suggested something similar in the past. Its ideally where we would like to end up I would think, but at the same time, it requires a lot more effort to implement the match logic in each query. I'll be happy to give it a shot as I can, but it would be nice to have some sort of support on a shorter time frame, even if inefficient. I think there are a lot of setups out there where you wont have enough matches in a document to consume that much memory. To play nice, we ship with constantscore scoring off with the option to enable. Technically, its not really going to be any less efficient than the current support for non constantscore wildcard,prefix,range (other than the instantiated index). We can certainly hold off for a while to see where a match solution might go, but if it doesn't get much momentum, I'd like to offer this option. Its a similar rationality to the Span Highlighter itself. It could have been done more efficiently if it broke the old API, and it could have been done right if we built in support to core Lucene, but the way it was done - its there and available and other ways are not. I'll let this stew and play around with the token matcher idea for a bit if/when/as time arises. SpanQueries, FilterQueries, ConstantScoreQueries, oh my.
          Hide
          Mark Harwood added a comment -

          Delegating responsibility away from WeightedSpanTermExtractor seems desirable to avoid having special-case "if clause instanceof xxx"- type logic everywhere.
          I suppose "extractTerms" was the core Query API's previous nod to supporting highlighting but is obviously falling short of what we need because:
          1) It relies on generating potentially large lists of terms
          2) It has no corresponding API that can be called on nested Filters
          3) It does not reproduce positional logic e.g. Phrase/span queries

          The above TokenMatcher API certainly seems like a clean API from the highlighter's perspective but implementing it in queries/filters could prove trickier - especially for "container" classes such as BooleanQuery, SpanQuery, FilteredQuery, ConstantScoreQueries wrapping CachingWrapperFilters etc.

          There's a lot to consider here and Mark M's suggestion of adding more "special-case" logic into WeightedSpanTermExtractor may be a sensible holding position for now (guaranteed, as soon as Lucene-1424 is committed people will start squealing for some related highlighting support!) Another temporary alternative to consider is for Lucene-1424's "ConstantScoreXxxxQuerys" to implement extractTerms() in which case they retain the list of terms they generate - solely for highlighting purposes. This would require no specialised logic to be added to WeightedSpanTermExtractor and makes use of the existing-but-suboptimal "highlighter-support" API (Query.extractTerms)

          Show
          Mark Harwood added a comment - Delegating responsibility away from WeightedSpanTermExtractor seems desirable to avoid having special-case "if clause instanceof xxx"- type logic everywhere. I suppose "extractTerms" was the core Query API's previous nod to supporting highlighting but is obviously falling short of what we need because: 1) It relies on generating potentially large lists of terms 2) It has no corresponding API that can be called on nested Filters 3) It does not reproduce positional logic e.g. Phrase/span queries The above TokenMatcher API certainly seems like a clean API from the highlighter's perspective but implementing it in queries/filters could prove trickier - especially for "container" classes such as BooleanQuery, SpanQuery, FilteredQuery, ConstantScoreQueries wrapping CachingWrapperFilters etc. There's a lot to consider here and Mark M's suggestion of adding more "special-case" logic into WeightedSpanTermExtractor may be a sensible holding position for now (guaranteed, as soon as Lucene-1424 is committed people will start squealing for some related highlighting support!) Another temporary alternative to consider is for Lucene-1424's "ConstantScoreXxxxQuerys" to implement extractTerms() in which case they retain the list of terms they generate - solely for highlighting purposes. This would require no specialised logic to be added to WeightedSpanTermExtractor and makes use of the existing-but-suboptimal "highlighter-support" API (Query.extractTerms)
          Hide
          Mark Miller added a comment -

          I'd like to go with this method as a stop gap solution for now. Its no worse than the previous approach of rewriting multi-term queries against the reader - in fact, its often much better as this approach rewrites against a memory index that only contains the doc to be highlighted.

          I think its important that we offer some way to highlight all of the query types. Performance and improvements can come later.

          This is an extension to the SpanScorer and won't affect the core Highlighter. So people can choose to use the SpanScorer if they want positional support (the non positional support is no slower), and then they can further choose if they want to expand multi term queries in the way provided (expanding multi-term defaults to false). Its generally going to work quite well, and the alternative at the moment is no highlighting.

          A general better solution that does not work with positions would probably be better applied to the standard Highlighter, and then the SpanScorer can take advantage of it at that time. Until then, I'd like to offer full query type highlighting support with the SpanScorer since it is possible.

          So this solution adds the ability to turn on highlighting of all the Querys that extend MutliTermQuery, constantscore mode true or false. That gives the SpanScorer almost full query type highlighting coverage I think.

          Show
          Mark Miller added a comment - I'd like to go with this method as a stop gap solution for now. Its no worse than the previous approach of rewriting multi-term queries against the reader - in fact, its often much better as this approach rewrites against a memory index that only contains the doc to be highlighted. I think its important that we offer some way to highlight all of the query types. Performance and improvements can come later. This is an extension to the SpanScorer and won't affect the core Highlighter. So people can choose to use the SpanScorer if they want positional support (the non positional support is no slower), and then they can further choose if they want to expand multi term queries in the way provided (expanding multi-term defaults to false). Its generally going to work quite well, and the alternative at the moment is no highlighting. A general better solution that does not work with positions would probably be better applied to the standard Highlighter, and then the SpanScorer can take advantage of it at that time. Until then, I'd like to offer full query type highlighting support with the SpanScorer since it is possible. So this solution adds the ability to turn on highlighting of all the Querys that extend MutliTermQuery, constantscore mode true or false. That gives the SpanScorer almost full query type highlighting coverage I think.
          Hide
          Mark Miller added a comment -

          I'd like to commit this soon.

          Show
          Mark Miller added a comment - I'd like to commit this soon.
          Hide
          Mark Miller added a comment -

          needs a little fix to work with null field option and possibly default field

          Show
          Mark Miller added a comment - needs a little fix to work with null field option and possibly default field

            People

            • Assignee:
              Mark Miller
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development