Solr
  1. Solr
  2. SOLR-6062

Phrase queries are created for each field supplied through edismax's pf, pf2 and pf3 parameters (rather them being combined in a single dismax query)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.10
    • Component/s: query parsers
    • Labels:
      None

      Description

      SOLR-2058 subtly changed how phrase queries, created through the pf, pf2 and pf3 parameters, are merged into the main user query.

      For the query: 'term1 term2' with pf2:[field1, field2, field3] we now get (omitting the non phrase query section for clarity):

      <main query>
      DisjunctionMaxQuery((field1:"term1 term2"^1.0)~0.1)
      DisjunctionMaxQuery((field2:"term1 term2"^1.0)~0.1)
      DisjunctionMaxQuery((field3:"term1 term2"^1.0)~0.1)
      

      Prior to this change, we had:

      <main query> 
      DisjunctionMaxQuery((field1:"term1 term2"^1.0 | field2:"term1 term2"^1.0 | field3:"term1 term2"^1.0)~0.1)
      

      The upshot being that if the phrase query "term1 term2" appears in multiple fields, it will get a significant boost over the previous implementation.

        Issue Links

          Activity

          Hide
          Michael Dodsworth added a comment -

          As was mentioned on this issue, the behavioral change was not desirable.

          Show
          Michael Dodsworth added a comment - As was mentioned on this issue, the behavioral change was not desirable.
          Hide
          Michael Dodsworth added a comment -

          Rather than sending each FieldParam through addShingledPhraseQueries individually (which results in a dismax query per field), we're now grouping the phrase fields by their wordGram count and sending each group through addShingledPhraseQueries.

          One slight complication is that the original/linked issue allowed the same field to be passed through a pf parameter with differing slop values. The intent being that those scores would be combined, rather than the max being used across those fields. In order to continue support for that feature, we're also grouping FieldParams by their associated slop values (passing each group through independently).

          I've added a test for the multi-field case. If people are happy with the approach, I can combine the wordGram and slop value grouping into a single pass.

          Show
          Michael Dodsworth added a comment - Rather than sending each FieldParam through addShingledPhraseQueries individually (which results in a dismax query per field), we're now grouping the phrase fields by their wordGram count and sending each group through addShingledPhraseQueries. One slight complication is that the original/linked issue allowed the same field to be passed through a pf parameter with differing slop values. The intent being that those scores would be combined, rather than the max being used across those fields. In order to continue support for that feature, we're also grouping FieldParams by their associated slop values (passing each group through independently). I've added a test for the multi-field case. If people are happy with the approach, I can combine the wordGram and slop value grouping into a single pass.
          Hide
          Michael Dodsworth added a comment -
          Show
          Michael Dodsworth added a comment - Naomi Dushay Ron Mayer
          Hide
          Michael Dodsworth added a comment -

          all comments and feedback welcome.

          Show
          Michael Dodsworth added a comment - all comments and feedback welcome.
          Hide
          Michael Dodsworth added a comment -
          Show
          Michael Dodsworth added a comment - adding James Dyer Jan Høydahl , as you were involved in https://issues.apache.org/jira/browse/SOLR-2058
          Hide
          Ron Mayer added a comment -

          Regarding "the original/linked issue allowed the same field to be passed through a pf parameter with differing slop values. The intent being that those scores would be combined, rather than the max being used across those fields". The observation that lead to using the same field with different slop values was that if: either many of the words in searched clauses were in the same paragraph ( a pretty large slop value); or many pairs of words from search clauses were in the same adjectives/noun clauses of the text (quite small small slop value; to make a search for 'old hairy cat' rank well against 'hairy old cat' ) a document was likely to be interesting.

          If I understand right, it sounds to me like what Michael described continue to be good for those cases. I'm traveling this week, but have some test cases comparing ranking of solr-2058 vs human-sorted documents that I can run when I'm back thursday of next week.

          Show
          Ron Mayer added a comment - Regarding "the original/linked issue allowed the same field to be passed through a pf parameter with differing slop values. The intent being that those scores would be combined, rather than the max being used across those fields". The observation that lead to using the same field with different slop values was that if: either many of the words in searched clauses were in the same paragraph ( a pretty large slop value); or many pairs of words from search clauses were in the same adjectives/noun clauses of the text (quite small small slop value; to make a search for 'old hairy cat' rank well against 'hairy old cat' ) a document was likely to be interesting. If I understand right, it sounds to me like what Michael described continue to be good for those cases. I'm traveling this week, but have some test cases comparing ranking of solr-2058 vs human-sorted documents that I can run when I'm back thursday of next week.
          Hide
          Michael Dodsworth added a comment - - edited

          Thanks for looking at this, Ron Mayer,

          Here's an example (post fix) that shows both the grouping within a particular pf? query (where the supplied fields have the same slop) and the splitting out/layering of queries when different slops are used for the same field(s). Hold on to your hats...

          {"q", "aaaa bbbb cccc",
           "qf", "phrase_sw phrase1_sw",
           "pf", "phrase_sw~1^10 phrase_sw~2^20 phrase_sw^30",
           "pf2", "phrase_sw~2^22 phrase_sw^33 phrase1_sw~2^44 phrase1_sw~4^55",
           "pf3", "phrase_sw~2^222 phrase_sw^333 phrase1_sw~2^444 phrase1_sw~4^555"}
          
          # pf -- phrase_sw with 3 different slop values results in 3 independent dismax queries
          DisjunctionMaxQuery((phrase_sw:"aaaa bbbb cccc"~1^10.0)) 
          DisjunctionMaxQuery((phrase_sw:"aaaa bbbb cccc"~2^20.0)) 
          DisjunctionMaxQuery((phrase_sw:"aaaa bbbb cccc"^30.0)) 
          
          # pf2 -- phrase_sw and phrase1_sw were both supplied with a slop of 2, so those queries are grouped
          (
            DisjunctionMaxQuery((phrase_sw:"aaaa bbbb"~2^22.0 | phrase1_sw:"aaaa bbbb"~2^44.0)) 
            DisjunctionMaxQuery((phrase_sw:"bbbb cccc"~2^22.0 | phrase1_sw:"bbbb cccc"~2^44.0))
          ) 
          
          (
            DisjunctionMaxQuery((phrase_sw:"aaaa bbbb"^33.0)) 
            DisjunctionMaxQuery((phrase_sw:"bbbb cccc"^33.0))
          )
          
          (
            DisjunctionMaxQuery((phrase1_sw:"aaaa bbbb"~4^55.0)) 
            DisjunctionMaxQuery((phrase1_sw:"bbbb cccc"~4^55.0))
          )
          
          # pf3
          DisjunctionMaxQuery((phrase_sw:"aaaa bbbb cccc"~2^222.0 | phrase1_sw:"aaaa bbbb cccc"~2^444.0)) 
          DisjunctionMaxQuery((phrase_sw:"aaaa bbbb cccc"^333.0)) 
          DisjunctionMaxQuery((phrase1_sw:"aaaa bbbb cccc"~4^555.0)))
          
          
          Show
          Michael Dodsworth added a comment - - edited Thanks for looking at this, Ron Mayer , Here's an example (post fix) that shows both the grouping within a particular pf? query (where the supplied fields have the same slop) and the splitting out/layering of queries when different slops are used for the same field(s). Hold on to your hats... { "q" , "aaaa bbbb cccc" , "qf" , "phrase_sw phrase1_sw" , "pf" , "phrase_sw~1^10 phrase_sw~2^20 phrase_sw^30" , "pf2" , "phrase_sw~2^22 phrase_sw^33 phrase1_sw~2^44 phrase1_sw~4^55" , "pf3" , "phrase_sw~2^222 phrase_sw^333 phrase1_sw~2^444 phrase1_sw~4^555" } # pf -- phrase_sw with 3 different slop values results in 3 independent dismax queries DisjunctionMaxQuery((phrase_sw: "aaaa bbbb cccc" ~1^10.0)) DisjunctionMaxQuery((phrase_sw: "aaaa bbbb cccc" ~2^20.0)) DisjunctionMaxQuery((phrase_sw: "aaaa bbbb cccc" ^30.0)) # pf2 -- phrase_sw and phrase1_sw were both supplied with a slop of 2, so those queries are grouped ( DisjunctionMaxQuery((phrase_sw: "aaaa bbbb" ~2^22.0 | phrase1_sw: "aaaa bbbb" ~2^44.0)) DisjunctionMaxQuery((phrase_sw: "bbbb cccc" ~2^22.0 | phrase1_sw: "bbbb cccc" ~2^44.0)) ) ( DisjunctionMaxQuery((phrase_sw: "aaaa bbbb" ^33.0)) DisjunctionMaxQuery((phrase_sw: "bbbb cccc" ^33.0)) ) ( DisjunctionMaxQuery((phrase1_sw: "aaaa bbbb" ~4^55.0)) DisjunctionMaxQuery((phrase1_sw: "bbbb cccc" ~4^55.0)) ) # pf3 DisjunctionMaxQuery((phrase_sw: "aaaa bbbb cccc" ~2^222.0 | phrase1_sw: "aaaa bbbb cccc" ~2^444.0)) DisjunctionMaxQuery((phrase_sw: "aaaa bbbb cccc" ^333.0)) DisjunctionMaxQuery((phrase1_sw: "aaaa bbbb cccc" ~4^555.0)))
          Hide
          Michael Dodsworth added a comment -

          Ron Mayer does that output look correct to you?

          Show
          Michael Dodsworth added a comment - Ron Mayer does that output look correct to you?
          Hide
          Michael Dodsworth added a comment -

          Any feedback, Ron Mayer

          Show
          Michael Dodsworth added a comment - Any feedback, Ron Mayer
          Hide
          Michael Dodsworth added a comment -

          James Dyer, any feedback?

          Show
          Michael Dodsworth added a comment - James Dyer , any feedback?
          Hide
          Erik Hatcher added a comment -

          Michael Dodsworth looks reasonable to me, but I think best if we get some others deep into this stuff to weigh in. One typo, looks like, in the patch is "WORK_GRAM_EXTRACTOR", should be WORD.

          Show
          Erik Hatcher added a comment - Michael Dodsworth looks reasonable to me, but I think best if we get some others deep into this stuff to weigh in. One typo, looks like, in the patch is "WORK_GRAM_EXTRACTOR", should be WORD.
          Hide
          Michael Dodsworth added a comment -

          Thanks for looking at this, Erik Hatcher. Any suggestions on folks to pull in?

          Show
          Michael Dodsworth added a comment - Thanks for looking at this, Erik Hatcher . Any suggestions on folks to pull in?
          Hide
          Erik Hatcher added a comment -

          Any suggestions on folks to pull in?

          I guess maybe easier to ask if there are any objections or cons to making this change by anyone? Seems like there's a bit of agreement that SOLR-2058 made things worse and this is better. Any negatives?

          Show
          Erik Hatcher added a comment - Any suggestions on folks to pull in? I guess maybe easier to ask if there are any objections or cons to making this change by anyone? Seems like there's a bit of agreement that SOLR-2058 made things worse and this is better. Any negatives?
          Hide
          Michael Dodsworth added a comment -

          not that I know of – the wanted behavior of SOLR-2058 is supported (by supplying different slop values for the same field) as well as the original behavior.

          Show
          Michael Dodsworth added a comment - not that I know of – the wanted behavior of SOLR-2058 is supported (by supplying different slop values for the same field) as well as the original behavior.
          Hide
          ASF subversion and git services added a comment -

          Commit 1617719 from Erik Hatcher in branch 'dev/trunk'
          [ https://svn.apache.org/r1617719 ]

          SOLR-6062: Fix undesirable edismax query parser effect (introduced in SOLR-2058) in how phrase queries

          Show
          ASF subversion and git services added a comment - Commit 1617719 from Erik Hatcher in branch 'dev/trunk' [ https://svn.apache.org/r1617719 ] SOLR-6062 : Fix undesirable edismax query parser effect (introduced in SOLR-2058 ) in how phrase queries
          Hide
          ASF subversion and git services added a comment -

          Commit 1617721 from Erik Hatcher in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1617721 ]

          SOLR-6062: Fix undesirable edismax query parser effect (introduced in SOLR-2058) in how phrase queries generated from pf, pf2, and pf3 are merged into the main query. (merged from trunk r1617719)

          Show
          ASF subversion and git services added a comment - Commit 1617721 from Erik Hatcher in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1617721 ] SOLR-6062 : Fix undesirable edismax query parser effect (introduced in SOLR-2058 ) in how phrase queries generated from pf, pf2, and pf3 are merged into the main query. (merged from trunk r1617719)
          Hide
          Erik Hatcher added a comment -

          oops, my commit message on trunk (r1617719) got truncated.

          Show
          Erik Hatcher added a comment - oops, my commit message on trunk (r1617719) got truncated.
          Hide
          Erik Hatcher added a comment -

          Michael Dodsworth - I'm with you on this, and decided to take your work as-is (with s/WORK/WORD) and commit it to both trunk and 4x. Thanks for your effort and patience!

          Show
          Erik Hatcher added a comment - Michael Dodsworth - I'm with you on this, and decided to take your work as-is (with s/WORK/WORD) and commit it to both trunk and 4x. Thanks for your effort and patience!
          Hide
          Michael Dodsworth added a comment -

          Fantastic. Thank you, Erik Hatcher.

          Show
          Michael Dodsworth added a comment - Fantastic. Thank you, Erik Hatcher .

            People

            • Assignee:
              Erik Hatcher
              Reporter:
              Michael Dodsworth
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development