Solr
  1. Solr
  2. SOLR-2058

Adds optional "phrase slop" to edismax "pf2", "pf3" and "pf" parameters with field~slop^boost syntax

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: query parsers
    • Labels:
      None
    • Environment:

      n/a

      Description

      http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3C4C659119.2010007@0ape.com%3E

      From Ron Mayer <r...@0ape.com>
      ... my results might be even better if I had a couple different "pf2"s with different "ps"'s at the same time. In particular. One with ps=0 to put a high boost on ones the have the right ordering of words. For example insuring that [the query]:
      "red hat black jacket"
      boosts only documents with "red hats" and not "black hats". And another pf2 with a more modest boost with ps=5 or so to handle the query above also boosting docs with
      "red baseball hat".

      http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3CAANLkTimd+V3g6d_MnHP+JYkKd+dej8FVMvF_1LQoiLBU@mail.gmail.com%3E

      From Yonik Seeley <yo...@lucidimagination.com>
      Perhaps fold it into the pf/pf2 syntax?

      pf=text^2 // current syntax... makes phrases with a boost of 2
      pf=text~1^2 // proposed syntax... makes phrases with a slop of 1 and
      a boost of 2

      That actually seems pretty natural given the lucene query syntax - an
      actual boosted sloppy phrase query already looks like
      text:"foo bar"~1^2

      -Yonik

      http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3Calpine.DEB.1.10.1008161300510.6335@radix.cryptio.net%3E

      From Chris Hostetter <hossman_luc...@fucit.org>

      Big +1 to this idea ... the existing "ps" param can stick arround as the
      default for any field that doesn't specify it's own slop in the pf/pf2/pf3
      fields using the "~" syntax.

      -Hoss

      1. edismax_pf_with_slop_v2.1.patch
        7 kB
        Ron Mayer
      2. edismax_pf_with_slop_v2.patch
        8 kB
        Ron Mayer
      3. pf2_with_slop.patch
        6 kB
        Ron Mayer
      4. SOLR-2058.patch
        10 kB
        James Dyer
      5. SOLR-2058-and-3351-not-finished.patch
        82 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Ron Mayer added a comment - - edited

          This patch is my first draft at implementing this feature.

          Any feedback would be appreciated.

          It seems to happily turn a query like
          http://localhost:8983/solr/select?defType=edismax&fl=id,text,score&q=enterprise+search+foobar&ps=5&qf=text&debugQuery=true&pf2=name~0^5555&pf2=name^12+name~10

          into what I believe is the desired parsed query:

          +((text:enterpris) (text:search) (text:foobar)) ((name:"enterprise search"~5^12.0) (name:"search foobar"~5^12.0)) ((name:"enterprise search"^5555.0) (name:"search foobar"^5555.0)) ((name:"enterprise search"~10) (name:"search foobar"~10))

          which looks like it should give a high boost to docs where both words appear right next to each other, but still substantial boosts to docs where the pairs of words are a few words apart.

          Show
          Ron Mayer added a comment - - edited This patch is my first draft at implementing this feature. Any feedback would be appreciated. It seems to happily turn a query like http://localhost:8983/solr/select?defType=edismax&fl=id,text,score&q=enterprise+search+foobar&ps=5&qf=text&debugQuery=true&pf2=name~0^5555&pf2=name^12+name~10 into what I believe is the desired parsed query: +((text:enterpris) (text:search) (text:foobar)) ((name:"enterprise search"~5^12.0) (name:"search foobar"~5^12.0)) ((name:"enterprise search"^5555.0) (name:"search foobar"^5555.0)) ((name:"enterprise search"~10) (name:"search foobar"~10)) which looks like it should give a high boost to docs where both words appear right next to each other, but still substantial boosts to docs where the pairs of words are a few words apart.
          Hide
          Yonik Seeley added a comment -

          Map<Integer,Map<String,Float>> is an "interesting" way to encode the slop and the boost
          But perhaps we should make a FieldParams class?

          We could keep the separate pf,pf2,pf3 maps... or encode the number of terms to make phrases out of in the FieldParams class:

          class FIeldParams {
            int wordGrams;  // make bigrams if 2, trigrams if 3, or all if MAX_INT
            int slop;
            float boost;
          }
          
          Show
          Yonik Seeley added a comment - Map<Integer,Map<String,Float>> is an "interesting" way to encode the slop and the boost But perhaps we should make a FieldParams class? We could keep the separate pf,pf2,pf3 maps... or encode the number of terms to make phrases out of in the FieldParams class: class FIeldParams { int wordGrams; // make bigrams if 2, trigrams if 3, or all if MAX_INT int slop; float boost; }
          Hide
          Ron Mayer added a comment -

          Totally agree that that was a bizarre way I used of encoding the boost.

          I did that on my first draft just to minimize the impact with the rest of the code (where some functions were expecting the "Map<String,Float>" pieces).

          I'll post an updated patch with a more sane class like the one you described. I'm new enough to the code that I'm not sure where such a class should reside. Any opinions?

          Show
          Ron Mayer added a comment - Totally agree that that was a bizarre way I used of encoding the boost. I did that on my first draft just to minimize the impact with the rest of the code (where some functions were expecting the "Map<String,Float>" pieces). I'll post an updated patch with a more sane class like the one you described. I'm new enough to the code that I'm not sure where such a class should reside. Any opinions?
          Hide
          Ron Mayer added a comment -

          Also wanted to note - I've been using this on a QA machine with 4 million documents, and it has been working extremely well for me; with multiple simultaneous phrase slop.

          In particular, if I use:

          • a high boost (500) on pf with slop of 0
          • a moderate boost (50) on pf with a slop of 50
          • a moderate boost (50) on pf2 with a slop of 0
          • a low boost (10) on pf2 with a slop of 10

          it's doing a great job of getting the most relevant document in the #1 spot, and a very good job at getting the entire first page of results filled with highly relevant documents.

          Show
          Ron Mayer added a comment - Also wanted to note - I've been using this on a QA machine with 4 million documents, and it has been working extremely well for me; with multiple simultaneous phrase slop. In particular, if I use: a high boost (500) on pf with slop of 0 a moderate boost (50) on pf with a slop of 50 a moderate boost (50) on pf2 with a slop of 0 a low boost (10) on pf2 with a slop of 10 it's doing a great job of getting the most relevant document in the #1 spot, and a very good job at getting the entire first page of results filled with highly relevant documents.
          Hide
          Ron Mayer added a comment - - edited

          Submitted an updated patch to use a more sane FieldParams class to pass fields,, boosts, and phrase slops instead of the bizarre Map<Integer,Map<String,Float>> I was using before.

          Show
          Ron Mayer added a comment - - edited Submitted an updated patch to use a more sane FieldParams class to pass fields,, boosts, and phrase slops instead of the bizarre Map<Integer,Map<String,Float>> I was using before.
          Hide
          Ron Mayer added a comment -

          Removed a couple unnecessary lines compared to the last version

          Show
          Ron Mayer added a comment - Removed a couple unnecessary lines compared to the last version
          Hide
          Ron Mayer added a comment -

          This one adds to the edismax syntax allowing different phrase slop for each pf, pf2, and pf3 expression; and makes it practical to have 2 different pf parameters with different slops.

          Show
          Ron Mayer added a comment - This one adds to the edismax syntax allowing different phrase slop for each pf, pf2, and pf3 expression; and makes it practical to have 2 different pf parameters with different slops.
          Hide
          Nick Hall added a comment -

          Thanks for creating this patch. I tried it out and would love to see this put into the base code. Being able to adjust the slop for pf/pf2/pf3 allows for a dramatic increase in search result relevance.

          Show
          Nick Hall added a comment - Thanks for creating this patch. I tried it out and would love to see this put into the base code. Being able to adjust the slop for pf/pf2/pf3 allows for a dramatic increase in search result relevance.
          Hide
          Ron Mayer added a comment - - edited

          Nick, thanks! Glad you like it.

          I'm keeping a version that's kept more up-to-date with trunk on github here:
          https://github.com/ramayer/lucene-solr/tree/solr_2058_edismax_pf2_phrase_slop

          (though I must admit I've tested that less than an internal fork of trunk we made in Sep 2010 and deployed with only a few additional cherry-picked patches)

          Show
          Ron Mayer added a comment - - edited Nick, thanks! Glad you like it. I'm keeping a version that's kept more up-to-date with trunk on github here: https://github.com/ramayer/lucene-solr/tree/solr_2058_edismax_pf2_phrase_slop (though I must admit I've tested that less than an internal fork of trunk we made in Sep 2010 and deployed with only a few additional cherry-picked patches)
          Hide
          James Dyer added a comment -

          Does anyone know offhand which issues still need to be resolved on this so that it can be committed?

          Show
          James Dyer added a comment - Does anyone know offhand which issues still need to be resolved on this so that it can be committed?
          Hide
          Jack Krupansky added a comment -

          Just a question: Is it 100% compatible with existing edismax usage, so that ps, ps2, and ps3 are default slops and that the individual field slops are optional in pf, pf2, and pf3 and if not explicitly present will default as they do today from ps, ps2, and ps3? Even so, it would be nice to specify the default slop up front on pf, pf2, and pf3 to apply to all fields without having to revert to using ps, ps2, and ps3.

          Another simple question: can slop and boost be in either order, or must slop be first, before boost?

          Show
          Jack Krupansky added a comment - Just a question: Is it 100% compatible with existing edismax usage, so that ps, ps2, and ps3 are default slops and that the individual field slops are optional in pf, pf2, and pf3 and if not explicitly present will default as they do today from ps, ps2, and ps3? Even so, it would be nice to specify the default slop up front on pf, pf2, and pf3 to apply to all fields without having to revert to using ps, ps2, and ps3. Another simple question: can slop and boost be in either order, or must slop be first, before boost?
          Hide
          James Dyer added a comment -

          Here is an updated patch based on Ron's August 31, 2010 version. I've cleaned up the code a little and added a unit test scenario. To recap what this does:

          • Fully backwards compatible with the existing pf/pf2/pf3/ps syntax.
          • Allows an optional slop parameter in the syntax "FieldName~slop^boost"
          • "ps" value is the default if the slop is not specified per-field

          This will give users the flexibility to say something like "if the words are kinda near each other, boost a little but if they are really near each other, boost a lot, etc"

          Unless someone objects, I will re-assign this issue to myself and commit early next week.

          Show
          James Dyer added a comment - Here is an updated patch based on Ron's August 31, 2010 version. I've cleaned up the code a little and added a unit test scenario. To recap what this does: Fully backwards compatible with the existing pf/pf2/pf3/ps syntax. Allows an optional slop parameter in the syntax "FieldName~slop^boost" "ps" value is the default if the slop is not specified per-field This will give users the flexibility to say something like "if the words are kinda near each other, boost a little but if they are really near each other, boost a lot, etc" Unless someone objects, I will re-assign this issue to myself and commit early next week.
          Hide
          Jack Krupansky added a comment -

          It would be nice to have a user-friendly error if the user makes a mistake like a space after ^ or ~ or writing field^2.0~3

          Show
          Jack Krupansky added a comment - It would be nice to have a user-friendly error if the user makes a mistake like a space after ^ or ~ or writing field^2.0~3
          Hide
          Jan Høydahl added a comment -

          James, you may grab it. I have a half-baked patch, but will rather take a stab later if there is anything to improve after your commit.

          Show
          Jan Høydahl added a comment - James, you may grab it. I have a half-baked patch, but will rather take a stab later if there is anything to improve after your commit.
          Hide
          James Dyer added a comment - - edited

          Jan, I'm curious: is the patch you're working on more of a "clean-up" of Jack's patch, or does it do something Jack's patch didn't do? I've been testing this and so far while it issues the dismax queries I'd expect, it doesn't help relevance like I though it would. so I'm a little less confident about it. I also was wondering if it should be generalized to allow "pf[\d]" so users can have pf4, pf5 etc if they wanted. This extra functionality wouldn't be hard to add but would it make sense?

          In any case, I would like to have something to commit here this week if possible. Your insight is very much appreciated.

          Show
          James Dyer added a comment - - edited Jan, I'm curious: is the patch you're working on more of a "clean-up" of Jack's patch, or does it do something Jack's patch didn't do? I've been testing this and so far while it issues the dismax queries I'd expect, it doesn't help relevance like I though it would. so I'm a little less confident about it. I also was wondering if it should be generalized to allow "pf [\d] " so users can have pf4, pf5 etc if they wanted. This extra functionality wouldn't be hard to add but would it make sense? In any case, I would like to have something to commit here this week if possible. Your insight is very much appreciated.
          Hide
          Jan Høydahl added a comment -

          I did a lot of things in the same patch - including ps2, ps3 support and refactoring of FieldSpec parsing to a separate class, and adding test cases for boosting. But there is wrapping up to do and I don't know if I'm 100% happy with using RegEx for parsing fieldspec. I'll attach what I have, but as I say I think it is better to add some of these improvements incrementally.

          Show
          Jan Høydahl added a comment - I did a lot of things in the same patch - including ps2, ps3 support and refactoring of FieldSpec parsing to a separate class, and adding test cases for boosting. But there is wrapping up to do and I don't know if I'm 100% happy with using RegEx for parsing fieldspec. I'll attach what I have, but as I say I think it is better to add some of these improvements incrementally.
          Hide
          Jan Høydahl added a comment -

          Attaching a patch doing also some ps2/ps3 stuff, adding more tests etc, but it's not finished. Unfortunately it's big partly due to whitespace differences

          Show
          Jan Høydahl added a comment - Attaching a patch doing also some ps2/ps3 stuff, adding more tests etc, but it's not finished. Unfortunately it's big partly due to whitespace differences
          Hide
          James Dyer added a comment -

          Committed to Trunk, r1342681.

          This is the May 17, 2012 patch which is a touched-up version of Ron Mayer's work from August 31, 2010 (Thank you!).

          Show
          James Dyer added a comment - Committed to Trunk, r1342681. This is the May 17, 2012 patch which is a touched-up version of Ron Mayer's work from August 31, 2010 (Thank you!).
          Hide
          Michael Dodsworth added a comment -

          It looks like this change also altered the way phrase queries are merged into the main query.

          For the query: 'term1 term2' with pf2:[field1, field2, field3] we now get (omitting the non phrase query section for clarity):

            <main query> DisjunctionMaxQuery((field1:"term1 term2"^1.0)~0.1)
          DisjunctionMaxQuery((field2:"term1 term2"^1.0)~0.1)
          DisjunctionMaxQuery((field3:"term1 term2"^1.0)~0.1)
          

          Prior to this change, we got:

            <main query> DisjunctionMaxQuery((field1:"term1 term2"^1.0 | field2:"term1 term2"^1.0 | field3:"term1 term2"^1.0)~0.1)
          

          The upshot being that if the phrase query "term1 term2" appears in multiple fields, it will get a significant boost over the previous implementation. The presence of the dismax queries makes me think this behavioral change was not intentional; if that's the case, let me know and I'll get a fix together.
          Thanks.

          Show
          Michael Dodsworth added a comment - It looks like this change also altered the way phrase queries are merged into the main query. For the query: 'term1 term2' with pf2: [field1, field2, field3] we now get (omitting the non phrase query section for clarity): <main query> DisjunctionMaxQuery((field1: "term1 term2" ^1.0)~0.1) DisjunctionMaxQuery((field2: "term1 term2" ^1.0)~0.1) DisjunctionMaxQuery((field3: "term1 term2" ^1.0)~0.1) Prior to this change, we got: <main query> DisjunctionMaxQuery((field1: "term1 term2" ^1.0 | field2: "term1 term2" ^1.0 | field3: "term1 term2" ^1.0)~0.1) The upshot being that if the phrase query "term1 term2" appears in multiple fields, it will get a significant boost over the previous implementation. The presence of the dismax queries makes me think this behavioral change was not intentional; if that's the case, let me know and I'll get a fix together. Thanks.
          Hide
          Michael Dodsworth added a comment -

          Can anyone comment on whether this change was intentional or accidental (and unwanted)?

          Show
          Michael Dodsworth added a comment - Can anyone comment on whether this change was intentional or accidental (and unwanted)?
          Hide
          Ron Mayer added a comment -

          I just tried them both (the committed one, and my original patch); and at least they both produce much better relevancy on my test data than I was able to get without the patch.

          However I agree with you that it looks to me like the change was probably unintentional and seems different from the way I think normal dismax queries work.

          TL/DR: I'm not sure. Anyone else care to either test&compare them or just look at the code and see which is more reasonable?

          Show
          Ron Mayer added a comment - I just tried them both (the committed one, and my original patch); and at least they both produce much better relevancy on my test data than I was able to get without the patch. However I agree with you that it looks to me like the change was probably unintentional and seems different from the way I think normal dismax queries work. TL/DR: I'm not sure. Anyone else care to either test&compare them or just look at the code and see which is more reasonable?
          Hide
          Naomi Dushay added a comment -

          Michael - per your comment on Sep 25, 2012 – that behavioral change is not desirable, in my opinion.

          Show
          Naomi Dushay added a comment - Michael - per your comment on Sep 25, 2012 – that behavioral change is not desirable, in my opinion.

            People

            • Assignee:
              James Dyer
              Reporter:
              Ron Mayer
            • Votes:
              4 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development