Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: Positions Branch
    • Fix Version/s: Positions Branch
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinn off from LUCENE-2878. Since we have positions on a large number of queries already in the branch is worth looking at highlighting as a real consumer of the API. A prototype is already committed.

      1. LUCENE-3318.patch
        68 kB
        Mike Sokolov
      2. LUCENE-3318.patch
        71 kB
        Mike Sokolov
      3. LUCENE-3318.patch
        82 kB
        Mike Sokolov

        Activity

        Hide
        Robert Muir added a comment -

        glancing at the branch, i would suggest dropping use of term vectors completely here, and building the prototype to store the start/end offsets into the payload for now. then the highlighter can translate positions into offsets via the payload.

        i think we should not use term vectors for highlighting... if the prototype gets going we can then look at encoding (optionally) this offsets information in the postings explicitly in parallel with positions more efficiently, e.g. in ways that exploit that fact that in a majority of cases endOffset - startOffset is going to be the same for every position of term X.

        Show
        Robert Muir added a comment - glancing at the branch, i would suggest dropping use of term vectors completely here, and building the prototype to store the start/end offsets into the payload for now. then the highlighter can translate positions into offsets via the payload. i think we should not use term vectors for highlighting... if the prototype gets going we can then look at encoding (optionally) this offsets information in the postings explicitly in parallel with positions more efficiently, e.g. in ways that exploit that fact that in a majority of cases endOffset - startOffset is going to be the same for every position of term X.
        Hide
        Mike Sokolov added a comment -

        That's been on my mind since you mentioned it earlier, Robert. I'll take a whack at it next week if someone doesn't beat me to it I do think it will be good to maintain the ability to highlight using whatever information is available, though.

        Show
        Mike Sokolov added a comment - That's been on my mind since you mentioned it earlier, Robert. I'll take a whack at it next week if someone doesn't beat me to it I do think it will be good to maintain the ability to highlight using whatever information is available, though.
        Hide
        Robert Muir added a comment -

        I do think it will be good to maintain the ability to highlight using whatever information is available, though.

        Well, the traditional school of thought has been, to do this (e.g. highlighters/morelikethis today that can work with tvs or without them). Personally I disagree with this reasoning. I think the whole point of indexing your content is to make searching fast: and we should make highlighting first class and make it kick ass.

        I guess a compromise would be to fall back to existing TVs, but not to re-analyzing the document at runtime: I guess I think that providing "slow options" is not actually user-friendly but instead just causes confusion and performance problems... better to kick out an error and say 'you must index your content with XYZ for this to work at all'.

        Show
        Robert Muir added a comment - I do think it will be good to maintain the ability to highlight using whatever information is available, though. Well, the traditional school of thought has been, to do this (e.g. highlighters/morelikethis today that can work with tvs or without them). Personally I disagree with this reasoning. I think the whole point of indexing your content is to make searching fast: and we should make highlighting first class and make it kick ass. I guess a compromise would be to fall back to existing TVs, but not to re-analyzing the document at runtime: I guess I think that providing "slow options" is not actually user-friendly but instead just causes confusion and performance problems... better to kick out an error and say 'you must index your content with XYZ for this to work at all'.
        Hide
        Yonik Seeley added a comment -

        The most useful fallback is being able to highlight via re-analyzing the text. It's perfectly scalable and the right tradeoff for some people (say indexing a billion tweets). Other people may not have as strict of time requirements, but may have more strict space requirements.

        If termvector based highlighting isn't much faster than re-analysis, that's the one we should drop support for.

        Show
        Yonik Seeley added a comment - The most useful fallback is being able to highlight via re-analyzing the text. It's perfectly scalable and the right tradeoff for some people (say indexing a billion tweets). Other people may not have as strict of time requirements, but may have more strict space requirements. If termvector based highlighting isn't much faster than re-analysis, that's the one we should drop support for.
        Hide
        Robert Muir added a comment -

        the last thing we need is another highlighter that falls back to this: this already exists.

        if we want to break index back compat, default offsets to ON, and require reindexing to upgrade to 4.0, then, I'm ok with the fallback.

        otherwise I am -1, as its just going to be a bad slow default.

        Show
        Robert Muir added a comment - the last thing we need is another highlighter that falls back to this: this already exists. if we want to break index back compat, default offsets to ON, and require reindexing to upgrade to 4.0, then, I'm ok with the fallback. otherwise I am -1, as its just going to be a bad slow default.
        Hide
        Mark Miller added a comment -

        Re-analyzing is often really not that slow for many, many applications. I have no problem with it as a fallback anytime.

        Show
        Mark Miller added a comment - Re-analyzing is often really not that slow for many, many applications. I have no problem with it as a fallback anytime.
        Hide
        Mark Miller added a comment -

        Thinking further - I suppose you could argue you that you have to explicitly enable an option to make this happen, rather then getting the behavior through fall back. But outright blocking it just seems silly.

        Show
        Mark Miller added a comment - Thinking further - I suppose you could argue you that you have to explicitly enable an option to make this happen, rather then getting the behavior through fall back. But outright blocking it just seems silly.
        Hide
        Hoss Man added a comment -

        I think rmuir's point is that at a Java class level, we already have a highlighter that can do this.

        If we add a new XyzBasedHighlighter class that can take advantage of some new XYZ information that can be recorded at index time, it's better for that highlighter to fail fast if the Xyz info isn't actually in the index – if you want term vectors or re-analysis as a fall back, then make that decision in your code when you call XyzBasedHighlighter.getHighlights().

        i don't really know anything about the existing Highlighter impl, but i agree with that philosophy at the Java API level

        (At a Solr API level there are a lot more options. the naive API is to make the user pick an impl and fail if it doesn't work, but alternately we could pick the impl based on what options are used on the fields being indexed, etc...)

        Show
        Hoss Man added a comment - I think rmuir's point is that at a Java class level, we already have a highlighter that can do this. If we add a new XyzBasedHighlighter class that can take advantage of some new XYZ information that can be recorded at index time, it's better for that highlighter to fail fast if the Xyz info isn't actually in the index – if you want term vectors or re-analysis as a fall back, then make that decision in your code when you call XyzBasedHighlighter.getHighlights(). i don't really know anything about the existing Highlighter impl, but i agree with that philosophy at the Java API level (At a Solr API level there are a lot more options. the naive API is to make the user pick an impl and fail if it doesn't work, but alternately we could pick the impl based on what options are used on the fields being indexed, etc...)
        Hide
        Robert Muir added a comment -

        thanks hossman. thats exactly it.

        there's no need for such fallbacks to be in lucene's library code: they are confusing and also can sometimes inadvertently create bad defaults. A user can implement this logic themselves.

        Show
        Robert Muir added a comment - thanks hossman. thats exactly it. there's no need for such fallbacks to be in lucene's library code: they are confusing and also can sometimes inadvertently create bad defaults. A user can implement this logic themselves.
        Hide
        Mark Miller added a comment -

        we already have a highlighter that can do this.

        I'm hoping that once we have a highlighter than can use this information, that can be a much better highlighter - the others can go away. If we can avoid it, I don't see why we would want to split effort over 3 different highlighters.

        then make that decision in your code when you call

        As I mentioned - I think it's fine to make it so that it has to be explicitly asked for in code. But I think it's silly to disallow that just because in some cases with huge documents and a giant maxCharsToAnalyze, it can be slow.

        and also can sometimes inadvertently create bad defaults.

        Defaults can always be bad in some cases - else they wouldn't be called defaults - they would just be the way it is. In this case, IMO, it's generally been a fine default for the old highlighter - it's the minority of use cases that have need something else - all you can expect from a good default.

        Show
        Mark Miller added a comment - we already have a highlighter that can do this. I'm hoping that once we have a highlighter than can use this information, that can be a much better highlighter - the others can go away. If we can avoid it, I don't see why we would want to split effort over 3 different highlighters. then make that decision in your code when you call As I mentioned - I think it's fine to make it so that it has to be explicitly asked for in code. But I think it's silly to disallow that just because in some cases with huge documents and a giant maxCharsToAnalyze, it can be slow. and also can sometimes inadvertently create bad defaults. Defaults can always be bad in some cases - else they wouldn't be called defaults - they would just be the way it is. In this case, IMO, it's generally been a fine default for the old highlighter - it's the minority of use cases that have need something else - all you can expect from a good default.
        Hide
        Yonik Seeley added a comment -

        OK, thanks for the clarifications. As long as we have the ability somehow/somewhere to do highlighting based on re-analysis, I'm good.

        Show
        Yonik Seeley added a comment - OK, thanks for the clarifications. As long as we have the ability somehow/somewhere to do highlighting based on re-analysis, I'm good.
        Hide
        Robert Muir added a comment -

        Defaults can always be bad in some cases - else they wouldn't be called defaults - they would just be the way it is. In this case, IMO, it's generally been a fine default for the old highlighter - it's the minority of use cases that have need something else - all you can expect from a good default.

        There doesn't need to be defaults at all: i think these differnt ways should be split into independent highlighting methods. we can re-use the same APIs (fragmenters or whatever) across the different implementations, if possible, but why not just force the user to use the algorithm they want? i dont think we need to have ABC+DEFHighlighter that does black magic across different algorithms, you should instead pick ABCHighlighter or DEFHighlighter.

        this is a library after all, I think this would be much cleaner.

        Show
        Robert Muir added a comment - Defaults can always be bad in some cases - else they wouldn't be called defaults - they would just be the way it is. In this case, IMO, it's generally been a fine default for the old highlighter - it's the minority of use cases that have need something else - all you can expect from a good default. There doesn't need to be defaults at all: i think these differnt ways should be split into independent highlighting methods. we can re-use the same APIs (fragmenters or whatever) across the different implementations, if possible, but why not just force the user to use the algorithm they want? i dont think we need to have ABC+DEFHighlighter that does black magic across different algorithms, you should instead pick ABCHighlighter or DEFHighlighter. this is a library after all, I think this would be much cleaner.
        Hide
        Mark Miller added a comment -

        but why not just force the user to use the algorithm they want?

        That I don't mind - just that the option exists if it's possible. Whether you choose through a setter or a different sub class, I don't mind. If it's not possible to re-analyze without keeping the other Highlighters around too, I'm much less for keeping the option around (at the least, the old highlighters should be heavily de-emphasized compared to the new one). I have not looked at the new code yet though.

        Show
        Mark Miller added a comment - but why not just force the user to use the algorithm they want? That I don't mind - just that the option exists if it's possible. Whether you choose through a setter or a different sub class, I don't mind. If it's not possible to re-analyze without keeping the other Highlighters around too, I'm much less for keeping the option around (at the least, the old highlighters should be heavily de-emphasized compared to the new one). I have not looked at the new code yet though.
        Hide
        Mike Sokolov added a comment -

        Uploading a patch for this that builds a TokenStream using position intervals from the query (matches) and their offsets, and then uses the existing Highlighter to do fragmentation and markup.

        This approach should make it easy to skip creating fragments for interstitial (non-matching) portions of large documents, but this issue doesn't cover that yet.

        The patch provides two methods for mapping positions to offsets; one is based on term vectors; the other uses offsets stored in payloads. And you can still use analysis. The payload version is about twice as fast as the term vector version, which is around 8x faster than reanalysis (comparable to FastVectorHighlighter). The choice of which to use (or whether to re-analyze) is up to the user; there are no auto-fallback behaviors in here

        Using these schemes makes fragmentation more difficult. The issue is that offsets and positions are not readily available for all tokens - only for those that matched the query. This makes it harder to fragment the document in reasonable places, and to surround the hits with some appropriate text. However, the substantial speedup seems to make it worth the effort.

        Some TODO's:

        There's currently no consistency-checking: if no offset-payloads were stored, and the user attempts to use them, they simply get no highlighting. I think there may be a hard fail if absent term vectors are requested though.

        Fragmentation doesn't necessarily land on a good boundary; we should at least scan for whitespace in a default fragmenter.

        Simon: something a bit weird happens when collecting position intervals now; in some cases the same interval can be collected twice. This happens w/ConjunctionPositionIterator - when PosCollector.collect(doc) calls advanceTo(doc), positions are collected, and then I iterate over more positions, and collect() them (which I have to do to get other cases to work); then during this latter iteration, the same intervals are reported again. I've worked around this easily enough, but I think it would be easier to work with if it didn't happen? Not sure how difficult that is to arrange.

        Also: I made all the collect() methods throw IOException so I could report exceptions from processing payloads.

        Show
        Mike Sokolov added a comment - Uploading a patch for this that builds a TokenStream using position intervals from the query (matches) and their offsets, and then uses the existing Highlighter to do fragmentation and markup. This approach should make it easy to skip creating fragments for interstitial (non-matching) portions of large documents, but this issue doesn't cover that yet. The patch provides two methods for mapping positions to offsets; one is based on term vectors; the other uses offsets stored in payloads. And you can still use analysis. The payload version is about twice as fast as the term vector version, which is around 8x faster than reanalysis (comparable to FastVectorHighlighter). The choice of which to use (or whether to re-analyze) is up to the user; there are no auto-fallback behaviors in here Using these schemes makes fragmentation more difficult. The issue is that offsets and positions are not readily available for all tokens - only for those that matched the query. This makes it harder to fragment the document in reasonable places, and to surround the hits with some appropriate text. However, the substantial speedup seems to make it worth the effort. Some TODO's: There's currently no consistency-checking: if no offset-payloads were stored, and the user attempts to use them, they simply get no highlighting. I think there may be a hard fail if absent term vectors are requested though. Fragmentation doesn't necessarily land on a good boundary; we should at least scan for whitespace in a default fragmenter. Simon: something a bit weird happens when collecting position intervals now; in some cases the same interval can be collected twice. This happens w/ConjunctionPositionIterator - when PosCollector.collect(doc) calls advanceTo(doc), positions are collected, and then I iterate over more positions, and collect() them (which I have to do to get other cases to work); then during this latter iteration, the same intervals are reported again. I've worked around this easily enough, but I think it would be easier to work with if it didn't happen? Not sure how difficult that is to arrange. Also: I made all the collect() methods throw IOException so I could report exceptions from processing payloads.
        Hide
        Uwe Schindler added a comment - - edited

        Some note from the TokenmStream policeman: The TokenStream must call clearAttributes() always as first action unless it returns false (see docs), the positionIncrement is then also automatically 1. Also you should make the whole class PosTokenStream final, as you can/should never extend it.

        Show
        Uwe Schindler added a comment - - edited Some note from the TokenmStream policeman: The TokenStream must call clearAttributes() always as first action unless it returns false (see docs), the positionIncrement is then also automatically 1. Also you should make the whole class PosTokenStream final, as you can/should never extend it.
        Hide
        Uwe Schindler added a comment -

        Hi,

        if (q instanceof MultiTermQuery) {
          ((MultiTermQuery)q).setRewriteMethod (MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE);
        }
        

        This changes the original query, so we should clone the query before (this is what the standard highlighter does when it rewrites queries) - but: maybe cloning was done before in other highlighter code. This one just looks wrong to me. Also for large indexes with many terms, this may easily throw TooManyClausesException, so we should catch this exception somehow and disable highlighting in that case.

        About the payloads encoding: I would not limit the offsets and positions and use the full 32bit ints, but instead encode them as vInt. The facetting package already has a method for this (it does similar thins, namely encoding ints into payloads), maybe we move those byte[] vInt encoders to core utils.

        Show
        Uwe Schindler added a comment - Hi, if (q instanceof MultiTermQuery) { ((MultiTermQuery)q).setRewriteMethod (MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE); } This changes the original query, so we should clone the query before (this is what the standard highlighter does when it rewrites queries) - but: maybe cloning was done before in other highlighter code. This one just looks wrong to me. Also for large indexes with many terms, this may easily throw TooManyClausesException, so we should catch this exception somehow and disable highlighting in that case. About the payloads encoding: I would not limit the offsets and positions and use the full 32bit ints, but instead encode them as vInt. The facetting package already has a method for this (it does similar thins, namely encoding ints into payloads), maybe we move those byte[] vInt encoders to core utils.
        Hide
        Robert Muir added a comment -

        About the payloads encoding: I would not limit the offsets and positions and use the full 32bit ints, but instead encode them as vInt. The facetting package already has a method for this (it does similar thins, namely encoding ints into payloads), maybe we move those byte[] vInt encoders to core utils.

        You can re-use a bytearraydataoutput here. see the analysis.synonyms builder package for an example.

        Show
        Robert Muir added a comment - About the payloads encoding: I would not limit the offsets and positions and use the full 32bit ints, but instead encode them as vInt. The facetting package already has a method for this (it does similar thins, namely encoding ints into payloads), maybe we move those byte[] vInt encoders to core utils. You can re-use a bytearraydataoutput here. see the analysis.synonyms builder package for an example.
        Hide
        Mike Sokolov added a comment -

        Thanks for the pointers; I'll follow up soon.

        Re: rewriting MultiTermQueries: the code that is in the patch is definitely just a starting point. It has the problems you pointed out Uwe, and also won't work properly in the case where a MTQ is not the top-level query: (foo* AND bar*) eg.

        This issue is dealt with in the current system by some complex handcrafted logic in WeightedSpanTermExtractor (via QueryScorer), which we are not using here. I hope we can do something simpler, but we do need to walk the query tree cloning MTQs at least - this seems to be what WeightedSpanTermExtractor does.

        Also - I did run a quick test to check, and it seems that the existing HL may throw a TooManyClauses in the same situation we would here. Perhaps HL consumers expect this? I think I agree that it would be more friendly to catch internally, though - can the consumer recover in any other way? I don't think so.

        Show
        Mike Sokolov added a comment - Thanks for the pointers; I'll follow up soon. Re: rewriting MultiTermQueries: the code that is in the patch is definitely just a starting point. It has the problems you pointed out Uwe, and also won't work properly in the case where a MTQ is not the top-level query: (foo* AND bar*) eg. This issue is dealt with in the current system by some complex handcrafted logic in WeightedSpanTermExtractor (via QueryScorer), which we are not using here. I hope we can do something simpler, but we do need to walk the query tree cloning MTQs at least - this seems to be what WeightedSpanTermExtractor does. Also - I did run a quick test to check, and it seems that the existing HL may throw a TooManyClauses in the same situation we would here. Perhaps HL consumers expect this? I think I agree that it would be more friendly to catch internally, though - can the consumer recover in any other way? I don't think so.
        Hide
        Robert Muir added a comment -

        Also - I did run a quick test to check, and it seems that the existing HL may throw a TooManyClauses in the same situation we would here. Perhaps HL consumers expect this? I think I agree that it would be more friendly to catch internally, though - can the consumer recover in any other way? I don't think so.

        Maybe the rewritemethod used here for highlighting should be instead TopTermsRewrite.

        Show
        Robert Muir added a comment - Also - I did run a quick test to check, and it seems that the existing HL may throw a TooManyClauses in the same situation we would here. Perhaps HL consumers expect this? I think I agree that it would be more friendly to catch internally, though - can the consumer recover in any other way? I don't think so. Maybe the rewritemethod used here for highlighting should be instead TopTermsRewrite.
        Hide
        Uwe Schindler added a comment -

        Also - I did run a quick test to check, and it seems that the existing HL may throw a TooManyClauses in the same situation we would here. Perhaps HL consumers expect this? I think I agree that it would be more friendly to catch internally, though - can the consumer recover in any other way? I don't think so.

        With standard highlighter the query is executed against a MemoryIndex with only this one document. Its unlikely to fail, but it can for very large docs (of course).

        Show
        Uwe Schindler added a comment - Also - I did run a quick test to check, and it seems that the existing HL may throw a TooManyClauses in the same situation we would here. Perhaps HL consumers expect this? I think I agree that it would be more friendly to catch internally, though - can the consumer recover in any other way? I don't think so. With standard highlighter the query is executed against a MemoryIndex with only this one document. Its unlikely to fail, but it can for very large docs (of course).
        Hide
        Mike Sokolov added a comment -

        Maybe the rewritemethod used here for highlighting should be instead TopTermsRewrite.

        That sounds right - we make a best effort. The primary use case for HL, IMO, is just to show a few representative samples that explain why the document matched.

        Show
        Mike Sokolov added a comment - Maybe the rewritemethod used here for highlighting should be instead TopTermsRewrite. That sounds right - we make a best effort. The primary use case for HL, IMO, is just to show a few representative samples that explain why the document matched.
        Hide
        Mike Sokolov added a comment -

        The TokenStream must call clearAttributes() always as first action unless it returns false (see docs)

        this makes sense, but I don't see anything in the javadocs of TokenStream or AttributeSource about it.

        Show
        Mike Sokolov added a comment - The TokenStream must call clearAttributes() always as first action unless it returns false (see docs) this makes sense, but I don't see anything in the javadocs of TokenStream or AttributeSource about it.
        Hide
        Mike Sokolov added a comment -

        updated patch encodes offset payloads as VInts, corrects TokenStream usage, and adds (ignored) test for complex query rewriting.

        Nothing done yet about query rewriting yet - my questions is: is there a better way to walk a query tree than a big switch statement with instanceof? It would be nice if Query provided List<Query> subs(), but I don't see it

        Show
        Mike Sokolov added a comment - updated patch encodes offset payloads as VInts, corrects TokenStream usage, and adds (ignored) test for complex query rewriting. Nothing done yet about query rewriting yet - my questions is: is there a better way to walk a query tree than a big switch statement with instanceof? It would be nice if Query provided List<Query> subs(), but I don't see it
        Hide
        Mike Sokolov added a comment -

        Updated patch now handles MultiTermQuerys nested (however deeply) in BooleanQuerys, and does not modify the queries in order to rewrite them.

        This approach can now be used for highlighting many types of queries (SpanQueries are a special case - they do their own MTQ rewriting), but still does require inspection of query types. It might be nice in the future to provide a Query rewriting interface that allows the caller more control over the kind of rewriting to be done. With this, it would be possible to expect future query types to implement a rewriting method that could support highlighting without the need for instanceof, etc.

        I added a simple white-space boundary detection that can be used to adjust snippet boundaries to fall between words. The basic idea could be extended to do sentence boundary detection.

        I added a Highlighter parameter (maxFragsToScore) that limits the number of fragments considered when attempting to find the highest-scoring one(s). This gives some decent speedup if you just want to find the first fragment in a document.

        Show
        Mike Sokolov added a comment - Updated patch now handles MultiTermQuerys nested (however deeply) in BooleanQuerys, and does not modify the queries in order to rewrite them. This approach can now be used for highlighting many types of queries (SpanQueries are a special case - they do their own MTQ rewriting), but still does require inspection of query types. It might be nice in the future to provide a Query rewriting interface that allows the caller more control over the kind of rewriting to be done. With this, it would be possible to expect future query types to implement a rewriting method that could support highlighting without the need for instanceof, etc. I added a simple white-space boundary detection that can be used to adjust snippet boundaries to fall between words. The basic idea could be extended to do sentence boundary detection. I added a Highlighter parameter (maxFragsToScore) that limits the number of fragments considered when attempting to find the highest-scoring one(s). This gives some decent speedup if you just want to find the first fragment in a document.

          People

          • Assignee:
            Mike Sokolov
            Reporter:
            Simon Willnauer
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development