Lucene - Core
  1. Lucene - Core
  2. LUCENE-5415

Support wildcard & co in PostingsHighlighter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      PostingsHighlighter uses the offsets encoded in the postings lists for the terms to find query matches.

      As such, it isn't really suitable for stuff like wildcards for two reasons:
      1. an expensive rewrite against the term dictionary (i think other highlighters share this problem)
      2. accumulating data from potentially many terms (e.g. reading many postings)

      However, we could provide an option for some of these queries to work, but in a different way, that avoids these downsides.

      Instead we can just grab the Automaton representation of the queries, and match it against the content directly (which won't blow up).

      1. LUCENE-5415.patch
        59 kB
        Robert Muir
      2. LUCENE-5415.patch
        42 kB
        Michael McCandless
      3. LUCENE-5415.patch
        38 kB
        Robert Muir
      4. LUCENE-5415.patch
        23 kB
        Robert Muir
      5. LUCENE-5415.patch
        13 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's a prototype. just has one trivial test (needs some more before committing), so the usual warnings apply. But it does not change the default behavior at all, or require any changes to the main loop of the highlighting algorithm.

        Show
        Robert Muir added a comment - Here's a prototype. just has one trivial test (needs some more before committing), so the usual warnings apply. But it does not change the default behavior at all, or require any changes to the main loop of the highlighting algorithm.
        Hide
        Michael McCandless added a comment -

        I love this approach! Assuming the analyzer is not too costly, this ought to scale much better than "rewrite Query up front" workaround when the query matches tons of terms, i.e. it's less susceptible to an adversary.

        The patch is surprisingly simple

        How will the FakeDocsEnum.freq() lie affect the default PassageScorer? Will this bias against passages that had an MTQ match?

        So, all MTQs are squished into a single fake/virtual term for matching, like I cannot tell which of the N MTQs in my query caused the hit. I think this is OK for starters: it's unusual (maybe?) to run multiple MTQs and to also care about which one matched each hit in the highlight. But I guess we could instead add N virtual terms, one per MTQ... later.

        Show
        Michael McCandless added a comment - I love this approach! Assuming the analyzer is not too costly, this ought to scale much better than "rewrite Query up front" workaround when the query matches tons of terms, i.e. it's less susceptible to an adversary. The patch is surprisingly simple How will the FakeDocsEnum.freq() lie affect the default PassageScorer? Will this bias against passages that had an MTQ match? So, all MTQs are squished into a single fake/virtual term for matching, like I cannot tell which of the N MTQs in my query caused the hit. I think this is OK for starters: it's unusual (maybe?) to run multiple MTQs and to also care about which one matched each hit in the highlight. But I guess we could instead add N virtual terms, one per MTQ... later.
        Hide
        Robert Muir added a comment -

        How will the FakeDocsEnum.freq() lie affect the default PassageScorer? Will this bias against passages that had an MTQ match?

        Terribly. Yes. Its a prototype But remember: these are typically constant-score in lucene.

        So, all MTQs are squished into a single fake/virtual term for matching, like I cannot tell which of the N MTQs in my query caused the hit. I think this is OK for starters: it's unusual (maybe?) to run multiple MTQs and to also care about which one matched each hit in the highlight. But I guess we could instead add N virtual terms, one per MTQ... later.

        Same as above: its a prototype. I avoided an automaton UNION operation (scared of perf, and well, multiple MTQs are rarish). But who uses the API to look at the terms? Does telling them which MTQ matched really seem that important? (nothing in the highlighter api uses this today!!!!!!) They have the offsets to know the actual text that matched still, so i mean... I think its ok for now?

        In both cases: I tried to avoid special casing this stuff (sorry, i think if you have a serious search system, then wildcards are rare), and instead add them in a non-disruptive way so that its clear it doesnt break the algorithm, which I think is generally working "ok".

        Show
        Robert Muir added a comment - How will the FakeDocsEnum.freq() lie affect the default PassageScorer? Will this bias against passages that had an MTQ match? Terribly. Yes. Its a prototype But remember: these are typically constant-score in lucene. So, all MTQs are squished into a single fake/virtual term for matching, like I cannot tell which of the N MTQs in my query caused the hit. I think this is OK for starters: it's unusual (maybe?) to run multiple MTQs and to also care about which one matched each hit in the highlight. But I guess we could instead add N virtual terms, one per MTQ... later. Same as above: its a prototype. I avoided an automaton UNION operation (scared of perf, and well, multiple MTQs are rarish). But who uses the API to look at the terms? Does telling them which MTQ matched really seem that important? (nothing in the highlighter api uses this today!!!!!!) They have the offsets to know the actual text that matched still, so i mean... I think its ok for now? In both cases: I tried to avoid special casing this stuff (sorry, i think if you have a serious search system, then wildcards are rare), and instead add them in a non-disruptive way so that its clear it doesnt break the algorithm, which I think is generally working "ok".
        Hide
        Robert Muir added a comment -

        Updated patch, I added some more queries, some more simple tests and refactored a bit to be less intrusive to PH.

        Still working...

        Show
        Robert Muir added a comment - Updated patch, I added some more queries, some more simple tests and refactored a bit to be less intrusive to PH. Still working...
        Hide
        Michael McCandless added a comment -

        Does telling them which MTQ matched really seem that important?

        It seems like it may be useful? Again, I don't think it should block this addition! It can be done later.

        But e.g. with the lucene server (LUCENE-5376) when you search for "fast lucene" and the content was "Lucene's PostingsHighlighter is fast", it returns this JSON today:

        {
         "startOffset":0,
         "endOffset":41,
         "parts":[
          {
           "term":"lucene",
           "text":"Lucene's"
          },
          " new PostingsHighlighter is ",
          {
           "term":"fast",
           "text":"fast"
          },
          "."
         ]
        }
        

        Ie, for each "hit" in the snippet it tells you which term from the original query "caused" that hit. So I thought it would also be useful somehow to know which MTQ caused the hit too ... later!

        Show
        Michael McCandless added a comment - Does telling them which MTQ matched really seem that important? It seems like it may be useful? Again, I don't think it should block this addition! It can be done later. But e.g. with the lucene server ( LUCENE-5376 ) when you search for "fast lucene" and the content was "Lucene's PostingsHighlighter is fast", it returns this JSON today: { "startOffset":0, "endOffset":41, "parts":[ { "term":"lucene", "text":"Lucene's" }, " new PostingsHighlighter is ", { "term":"fast", "text":"fast" }, "." ] } Ie, for each "hit" in the snippet it tells you which term from the original query "caused" that hit. So I thought it would also be useful somehow to know which MTQ caused the hit too ... later!
        Hide
        Robert Muir added a comment -

        Thanks Mike: I didn't realize you were using it. I can give you the string for what caused the hit, in this case I think Query.toString() is the right one (it generally matches whatever the wildcard or whatever was). Will be slightly evil, but its better than having null.

        Show
        Robert Muir added a comment - Thanks Mike: I didn't realize you were using it. I can give you the string for what caused the hit, in this case I think Query.toString() is the right one (it generally matches whatever the wildcard or whatever was). Will be slightly evil, but its better than having null.
        Hide
        Robert Muir added a comment -

        added more tests, and fakedocsenum returns the query that matched via its payload

        Show
        Robert Muir added a comment - added more tests, and fakedocsenum returns the query that matched via its payload
        Hide
        Michael McCandless added a comment -

        New patch just adding a test case to show that we "identify" which query caused each hit ... it passes!

        Show
        Michael McCandless added a comment - New patch just adding a test case to show that we "identify" which query caused each hit ... it passes!
        Hide
        Robert Muir added a comment -

        Thanks for the test Mike!

        I prevent creating this toString() per-match, added support for SpanMTQWrapper, added tests, and added solr support.

        I think this is ready.

        Show
        Robert Muir added a comment - Thanks for the test Mike! I prevent creating this toString() per-match, added support for SpanMTQWrapper, added tests, and added solr support. I think this is ready.
        Hide
        Michael McCandless added a comment -

        +1, patch looks great! Clever to use payload to send the Query back, thank you This way apps can know which query or term caused a given highlight hit.

        Also nice to fix those unrelated crabs in span queries...

        Show
        Michael McCandless added a comment - +1, patch looks great! Clever to use payload to send the Query back, thank you This way apps can know which query or term caused a given highlight hit. Also nice to fix those unrelated crabs in span queries...
        Hide
        ASF subversion and git services added a comment -

        Commit 1561451 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1561451 ]

        LUCENE-5415: add multitermquery support to PostingsHighlighter

        Show
        ASF subversion and git services added a comment - Commit 1561451 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1561451 ] LUCENE-5415 : add multitermquery support to PostingsHighlighter
        Hide
        ASF subversion and git services added a comment -

        Commit 1561456 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1561456 ]

        LUCENE-5415: add multitermquery support to PostingsHighlighter

        Show
        ASF subversion and git services added a comment - Commit 1561456 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1561456 ] LUCENE-5415 : add multitermquery support to PostingsHighlighter
        Hide
        Robert Muir added a comment -

        Thanks Mike

        Show
        Robert Muir added a comment - Thanks Mike
        Hide
        ASF subversion and git services added a comment -

        Commit 1563180 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1563180 ]

        LUCENE-5415,LUCENE-5376: get MultiTermQuery highlighting working; fix compilation errors

        Show
        ASF subversion and git services added a comment - Commit 1563180 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1563180 ] LUCENE-5415 , LUCENE-5376 : get MultiTermQuery highlighting working; fix compilation errors

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development