Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.9
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Related to LUCENE-1929 and the current inability to highlight NumericRangeQuery, spatial, cached term filters and other exotica.

      This patch provides the ability to wrap any Query objects and record match info as flags encoded in the overall document score.
      Using this approach it would be possible to understand (and therefore highlight) which fields matched clauses in a query.

      The match encoding approach loses some precision in scores as noted here: http://tinyurl.com/ykt8nx7

      Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
      This may be something we should consider.

        Activity

        Hide
        Michael McCandless added a comment -

        Very clever!

        Since you are wrapping arbitrary query objs, couldn't the wrapper make a separate data structure for tracking which clause matched (instead of encoding it into the score)?

        Also: doesn't highlighter run, separately, on each doc? And so it's OK if the scores are affected? Ie, I would run my main search with a normal query, get the 10 results for the current page, then step through each of those 10 doc IDs make a single-doc-IndexSearcher, and run this wrapper?

        Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs.
        This may be something we should consider.

        +1 I would love to see the Scorer API extended to optionally provide details on matches. Not just which clause matched which docs/fields, but the positions within the field where the match occurred. I think we could do this by absorbing *SpanQuery into their normal Query counterparts, making the getSpans API [somehow] optional so that if you didn't invoke it you don't pay a performance price.

        Show
        Michael McCandless added a comment - Very clever! Since you are wrapping arbitrary query objs, couldn't the wrapper make a separate data structure for tracking which clause matched (instead of encoding it into the score)? Also: doesn't highlighter run, separately, on each doc? And so it's OK if the scores are affected? Ie, I would run my main search with a normal query, get the 10 results for the current page, then step through each of those 10 doc IDs make a single-doc-IndexSearcher, and run this wrapper? Avoiding these precision issues would require a change to Lucene core to record docId, score AND a matchFlag byte in ScoreDoc objects and collector APIs. This may be something we should consider. +1 I would love to see the Scorer API extended to optionally provide details on matches. Not just which clause matched which docs/fields, but the positions within the field where the match occurred. I think we could do this by absorbing *SpanQuery into their normal Query counterparts, making the getSpans API [somehow] optional so that if you didn't invoke it you don't pay a performance price.
        Hide
        Mark Harwood added a comment -

        couldn't the wrapper make a separate data structure for tracking which clause matched

        I was trying to keep the processing cost super-low with no object allocations because this is in a very tight loop. We don't really want to be generating a lot of state/processing while we're still evaluating potentially millions of candidate matches.
        That seems to be the challenge doing this instrumentation in-line with the query execution.

        Also: doesn't highlighter run, separately, on each doc? And so it's OK if the scores are affected?

        The use case I'm tackling right now involves search forms with lots of optional fields (spatial, numeric, "choice" etc) and I only needed a yes/no match flag for each field. This approach should give me these answers back immediately without impacting query processing speeds significantly.
        However, I can see the value in core Lucene capturing a richer data structure than a simple boolean where you choose to do a seperate "highlight" pass on the top N documents. This would suggest that you might need 2 query expressions - one for execution and one for adding highlighter instrumentation. I suppose the client could add the instrumentation requests to the initial query which are passive during a Lucene "results-selection" mode and become active in "highlight mode".

        Show
        Mark Harwood added a comment - couldn't the wrapper make a separate data structure for tracking which clause matched I was trying to keep the processing cost super-low with no object allocations because this is in a very tight loop. We don't really want to be generating a lot of state/processing while we're still evaluating potentially millions of candidate matches. That seems to be the challenge doing this instrumentation in-line with the query execution. Also: doesn't highlighter run, separately, on each doc? And so it's OK if the scores are affected? The use case I'm tackling right now involves search forms with lots of optional fields (spatial, numeric, "choice" etc) and I only needed a yes/no match flag for each field. This approach should give me these answers back immediately without impacting query processing speeds significantly. However, I can see the value in core Lucene capturing a richer data structure than a simple boolean where you choose to do a seperate "highlight" pass on the top N documents. This would suggest that you might need 2 query expressions - one for execution and one for adding highlighter instrumentation. I suppose the client could add the instrumentation requests to the initial query which are passive during a Lucene "results-selection" mode and become active in "highlight mode".
        Hide
        Michael McCandless added a comment -

        I see, it sounds like your use case is different from the typical
        highlighting use case in that 1) you don't need the positions of the
        matches (just whether a given clause matched the doc or not), and 2)
        you need it for every single doc visited by the query, not just for
        the handful of docs that are being presented to the user on the
        current "page".

        This would suggest that you might need 2 query expressions - one for execution and one for adding highlighter instrumentation.

        I'm thinking it's the same query, but we fix the Scorer API for all
        queries (= big change!!) to be able to produce match details on
        demand, where those match details look something like what getSpans
        now returns. But for the normal case (only highlighting the docs
        being shown on current page), we'd only get the match details for that
        small set of docs.

        Then we ideally would not need a separate mirrored set of span
        queries. Ie, SpanTermQuery would be absorbed into TermQuery, etc.

        But I could easily be being too naive here Maybe there is some
        serious performance cost to even adding the optional API in.

        Show
        Michael McCandless added a comment - I see, it sounds like your use case is different from the typical highlighting use case in that 1) you don't need the positions of the matches (just whether a given clause matched the doc or not), and 2) you need it for every single doc visited by the query, not just for the handful of docs that are being presented to the user on the current "page". This would suggest that you might need 2 query expressions - one for execution and one for adding highlighter instrumentation. I'm thinking it's the same query, but we fix the Scorer API for all queries (= big change!!) to be able to produce match details on demand, where those match details look something like what getSpans now returns. But for the normal case (only highlighting the docs being shown on current page), we'd only get the match details for that small set of docs. Then we ideally would not need a separate mirrored set of span queries. Ie, SpanTermQuery would be absorbed into TermQuery, etc. But I could easily be being too naive here Maybe there is some serious performance cost to even adding the optional API in.
        Hide
        Mark Harwood added a comment -

        and 2) you need it for every single doc visited by the query

        Actually I don't need it for every doc, only the top ones - it just happens to be so cheap to produce that I can afford to run this in-line with the query. (I haven't actually benchmarked it at scale buy my gut feel is it would be fast )

        I was thinking that this might be orthogonal to the existing "free-text" based highlighter. The logic for this being roughly that

        1) Highlighting of free-text fields is reasonably well-catered for with summarisation etc.
        2) The remaining problem areas for highlighting (NumericRangeQuery, Spatial, Cached term filters on enums eg gender:male/female) are all likely to be non-free-text fields which don't require summarisation and only contain a single value.

        I may be wrong in these assumptions about the existing state of play (any thoughts, Mark M?) but it might be useful to think of attacking the problem with these 2 different requirements in mind.

        Regardless of type e.g. int, long etc I tend to think of fields as falling into these broad usage categories:

        a) "Identifiers" (e.g. primary keys)
        b) Quantifiers (e.g numerics, dates, spatial)
        c) Free-text
        d) Controlled vocabularies (e.g. enums such as gender:m/f)

        Type a ) is catered for with a straight TermQuery and therefore can be handled with the existing highlighter
        Type b) needs special indexes/queries (spatial/trie) and isn't catered for by the existing term/span-based Highlighter
        Type c) is catered for with the existing highlighter and its summarising features
        Type d) involves many TermDoc.next reads so is usefully cached as filters and therefore not catered for by existing Highlighter

        So this patch helps cater for types b) and d) where simply knowing the field matched is all that is required to highlight.

        Show
        Mark Harwood added a comment - and 2) you need it for every single doc visited by the query Actually I don't need it for every doc, only the top ones - it just happens to be so cheap to produce that I can afford to run this in-line with the query. (I haven't actually benchmarked it at scale buy my gut feel is it would be fast ) I was thinking that this might be orthogonal to the existing "free-text" based highlighter. The logic for this being roughly that 1) Highlighting of free-text fields is reasonably well-catered for with summarisation etc. 2) The remaining problem areas for highlighting (NumericRangeQuery, Spatial, Cached term filters on enums eg gender:male/female) are all likely to be non-free-text fields which don't require summarisation and only contain a single value. I may be wrong in these assumptions about the existing state of play (any thoughts, Mark M?) but it might be useful to think of attacking the problem with these 2 different requirements in mind. Regardless of type e.g. int, long etc I tend to think of fields as falling into these broad usage categories: a) "Identifiers" (e.g. primary keys) b) Quantifiers (e.g numerics, dates, spatial) c) Free-text d) Controlled vocabularies (e.g. enums such as gender:m/f) Type a ) is catered for with a straight TermQuery and therefore can be handled with the existing highlighter Type b) needs special indexes/queries (spatial/trie) and isn't catered for by the existing term/span-based Highlighter Type c) is catered for with the existing highlighter and its summarising features Type d) involves many TermDoc.next reads so is usefully cached as filters and therefore not catered for by existing Highlighter So this patch helps cater for types b) and d) where simply knowing the field matched is all that is required to highlight.
        Hide
        Lance Norskog added a comment -

        One task in titrating(1) these requests is the specialized v.s. the general case. The general case in this instance is redoing the explain API to use a real data structure. The special case is a custom change to the inner scoring loop for certain use cases.

        Do you wish to highlight all 5 million results from a query, or only 10 or 20? With a better explain API, it would be very fast to require these unusual use cases to do a second search limited to the queries they actually plan to highlight.

        (1) There's a word for deciding who gets medical care and who doesn't. And, no, it's not death panel.

        Show
        Lance Norskog added a comment - One task in titrating(1) these requests is the specialized v.s. the general case. The general case in this instance is redoing the explain API to use a real data structure. The special case is a custom change to the inner scoring loop for certain use cases. Do you wish to highlight all 5 million results from a query, or only 10 or 20? With a better explain API, it would be very fast to require these unusual use cases to do a second search limited to the queries they actually plan to highlight. (1) There's a word for deciding who gets medical care and who doesn't. And, no, it's not death panel .
        Hide
        Steve Rowe added a comment -

        ... titrating ... There's a word for deciding who gets medical care and who doesn't.

        triage?

        Show
        Steve Rowe added a comment - ... titrating ... There's a word for deciding who gets medical care and who doesn't. triage?
        Hide
        Benson Margulies added a comment -

        I have a potential application for this, and would be willing to work on it, assuming that committers have any interest in committing the results.

        Let me explain my particular case, which some of you may have seen discussed on solr-users.

        Imagine wanted to search for documents based on some relatively expensive similarity metric. Too expensive, by far, to want to run on every single document in the index, or even all the documents that pass some filter first.

        Further imagine that you come up with an approximation of the similarity metric in terms of Lucene query capabilities. The approximation is ordinary (e.g. no Solr Functions forcing a computation on each document), and approximates by having the same (or higher) recall than the real metric, but lower precision. Roughly, that the top 200 hits based on the approximation will contain the top 10 hits based on the real metric.

        OK, well, then, you can run this query, retrive documents, select the top hits, and then run the real metric. You get the right answer for far lower CPU time.

        And all of this works perfectly fine with Lucene (and Solr) as we know it. However, imagine a further challenge. You want to combine the approximation query with arbitrary other query terms – and then fix up the scores in the top documents to reflect the real metric.

        Well, you can run a second query on just the approximation query to get its score contribution, subtract it out, and add in (scaling here is a challenge) the results of the real metric.

        Or, it seems to me, you could use this approach here, as perhaps extended as discussed.

        ?

        Show
        Benson Margulies added a comment - I have a potential application for this, and would be willing to work on it, assuming that committers have any interest in committing the results. Let me explain my particular case, which some of you may have seen discussed on solr-users. Imagine wanted to search for documents based on some relatively expensive similarity metric. Too expensive, by far, to want to run on every single document in the index, or even all the documents that pass some filter first. Further imagine that you come up with an approximation of the similarity metric in terms of Lucene query capabilities. The approximation is ordinary (e.g. no Solr Functions forcing a computation on each document), and approximates by having the same (or higher) recall than the real metric, but lower precision. Roughly, that the top 200 hits based on the approximation will contain the top 10 hits based on the real metric. OK, well, then, you can run this query, retrive documents, select the top hits, and then run the real metric. You get the right answer for far lower CPU time. And all of this works perfectly fine with Lucene (and Solr) as we know it. However, imagine a further challenge. You want to combine the approximation query with arbitrary other query terms – and then fix up the scores in the top documents to reflect the real metric. Well, you can run a second query on just the approximation query to get its score contribution, subtract it out, and add in (scaling here is a challenge) the results of the real metric. Or, it seems to me, you could use this approach here, as perhaps extended as discussed. ?
        Hide
        Robert Muir added a comment -

        I think this issue is before the scorer navigation api???: LUCENE-2590, LUCENE-3330.

        If you want to know the inner details of subscorers and such, you can
        enumerate the Scorer "tree" inside your Collector.setNextReader, e.g. saving
        references to the subscorers you care about, and then in collect() you can
        do whatever you want via their freq()/score()/ etc methods to find out which
        ones matched...

        This is in trunk today and generally works, though there are some bugs to iron out
        (see LUCENE-3505 for example)

        Show
        Robert Muir added a comment - I think this issue is before the scorer navigation api???: LUCENE-2590 , LUCENE-3330 . If you want to know the inner details of subscorers and such, you can enumerate the Scorer "tree" inside your Collector.setNextReader, e.g. saving references to the subscorers you care about, and then in collect() you can do whatever you want via their freq()/score()/ etc methods to find out which ones matched... This is in trunk today and generally works, though there are some bugs to iron out (see LUCENE-3505 for example)
        Hide
        Benson Margulies added a comment - - edited

        Robert, Thanks. How do you all handle JIRAs like this? Would it make sense to close it as obviated by those other things?

        I also have to work out how to exploit this in Solr.

        Show
        Benson Margulies added a comment - - edited Robert, Thanks. How do you all handle JIRAs like this? Would it make sense to close it as obviated by those other things? I also have to work out how to exploit this in Solr.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mark Harwood
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development