Solr
  1. Solr
  2. SOLR-3026

eDismax: Locking down which fields can be explicitly queried (user fields aka uf)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 3.2, 3.3, 3.4, 3.5
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: search
    • Labels:
      None

      Description

      We need a way to specify exactly what fields should be available to the end user as fielded search.

      In the original SOLR-1553, there's a patch implementing "user fields", but it was never committed even if that issue was closed.

      1. SOLR-3026-3x.patch
        28 kB
        Jan Høydahl
      2. SOLR-3026.patch
        26 kB
        Jan Høydahl
      3. SOLR-3026.patch
        26 kB
        Tomás Fernández Löbbe
      4. SOLR-3026.patch
        23 kB
        Tomás Fernández Löbbe
      5. SOLR-3026.patch
        32 kB
        Tomás Fernández Löbbe
      6. SOLR-3026.patch
        17 kB
        Jan Høydahl
      7. SOLR-3026.patch
        16 kB
        Jan Høydahl
      8. SOLR-3026.patch
        11 kB
        Jan Høydahl
      9. SOLR-3026.patch
        8 kB
        Jan Høydahl

        Issue Links

          Activity

          Hide
          Jan Høydahl added a comment -

          I've worked with HossMan patch from SOLR-1553 and made it work with trunk. It's still a bit hacky I think, as I don't fully understand all the parsing stuff in there, but it works Here's how to use it:

          q=id:123               => HIT, default for empty "uf" is allowing all fields, as today
          q=id:123&uf=*          => HIT, explicitly allowing all fields
          q=id:123&uf=id         => HIT, explicitly allowing only field "id"
          q=id:123&uf=id^2       => HIT, using the boost from uf
          q=id:123&uf=*^3 -name  => HIT, allowing all fields except "name", using boost from "*"
          q=id:123&uf=-*         => NO HIT, disallowing all fields
          q=id:123&uf=* -id      => NO HIT, allowing all fields except "id"
          

          What happens when you query disallowed fields is that the ":" gets escaped so that the whole clause is searched as a literal.

          It feels a bit dirty to modify the "clause.raw" string in order to escape colon, and to add "^" boost etc. Don't know if this has other side effects. But it's a start. Please test it and comment on better ways of doing things.

          Show
          Jan Høydahl added a comment - I've worked with HossMan patch from SOLR-1553 and made it work with trunk. It's still a bit hacky I think, as I don't fully understand all the parsing stuff in there, but it works Here's how to use it: q=id:123 => HIT, default for empty "uf" is allowing all fields, as today q=id:123&uf=* => HIT, explicitly allowing all fields q=id:123&uf=id => HIT, explicitly allowing only field "id" q=id:123&uf=id^2 => HIT, using the boost from uf q=id:123&uf=*^3 -name => HIT, allowing all fields except "name", using boost from "*" q=id:123&uf=-* => NO HIT, disallowing all fields q=id:123&uf=* -id => NO HIT, allowing all fields except "id" What happens when you query disallowed fields is that the ":" gets escaped so that the whole clause is searched as a literal. It feels a bit dirty to modify the "clause.raw" string in order to escape colon, and to add "^" boost etc. Don't know if this has other side effects. But it's a start. Please test it and comment on better ways of doing things.
          Hide
          Yonik Seeley added a comment -

          One think I constantly get wrong whenever I try to query more than one field w/ dismax is the space separated lists... I naturally use commas.
          If it's not already there, should we allow comma separated lists for fieldnames as we do in other places?

          Show
          Yonik Seeley added a comment - One think I constantly get wrong whenever I try to query more than one field w/ dismax is the space separated lists... I naturally use commas. If it's not already there, should we allow comma separated lists for fieldnames as we do in other places?
          Hide
          Jan Høydahl added a comment -

          @yonik: It's a one-byte change to the split regex and it would work for qf,pf,uf... I'm not against it, but since it's not directly related to this issue, I propose a new JIRA for that change.

          Show
          Jan Høydahl added a comment - @yonik: It's a one-byte change to the split regex and it would work for qf,pf,uf... I'm not against it, but since it's not directly related to this issue, I propose a new JIRA for that change.
          Hide
          Jan Høydahl added a comment -

          @Hoss, this is based on your work, is it inline with what you envisioned back then? I chose to let the default match the current default, so you explicitly have to specify uf to lock things down.

          One feature I'd also like is field name aliasing, and I think it might fit well as an extension of uf. Imagine our schema has a field "searchable_title_t" which we want our Norwegian users to be able to query as "tittel". We could then specify a mapping, say

          &uf=tittel:searchable_title_t
          

          Could the "aliasing" feature of dismax be used for this or should we code it from scratch?

          Show
          Jan Høydahl added a comment - @Hoss, this is based on your work, is it inline with what you envisioned back then? I chose to let the default match the current default, so you explicitly have to specify uf to lock things down. One feature I'd also like is field name aliasing, and I think it might fit well as an extension of uf. Imagine our schema has a field "searchable_title_t" which we want our Norwegian users to be able to query as "tittel". We could then specify a mapping, say &uf=tittel:searchable_title_t Could the "aliasing" feature of dismax be used for this or should we code it from scratch?
          Hide
          Hoss Man added a comment -

          It's been so long, i don't really remember what i envisioned.

          I haven't had a chance to review the patches, but your description of the usecases looks great – my personal preference would be for an empty uf to default to not allowing any explicit fields, but i know i'm in the minority on that opinion, and your "-*" exclusion syntax makes it so easy to do i have absolutely no complaints.

          as for field name aliasing / virtual fields (ie: SOLR-3045) ... as i remember it, the underling "Alias" feature of the DisjunctionMaxQueryParser (i think that's what it's called) should work well for that – assuming the edismax usage of that underlying QueryParser doesn't circumvent it too much.

          As far as user syntax goes, i would suggest that the "per-field override" param syntax on the "qf" param would probably make the most sense here instead of using colons (and wouldn't require the special comma syntax you suggest in SOLR-3045 to specify multiple fields, which would prevent the general change yonik seems to want)

          ie...

             q=elephant title:dumbo who:george
            &qf=title^3 firstname lastname^2 description^2 catchall
            &uf=title^5 who^2 *
            &f.who.qf=firstname lastname^10
          

          ...would cause "elephant" to be searched in all the "qf" fields with the specified boosts; "dumbo" to be searched only against the title field (with a boost of 5 since the user asked for that field explicitly); and "george" will get a DisjunctionMaxQuery with a boost of 2, containing two clauses: firstname (default boost of 1) and lastname (boost of 10).

          Basically: when parsing the "uf" look for a "f.$

          {uf}

          .qf" param, and if it exists parse it and add the appropriate Alias. (fingers crossed it will be that easy: if it isn't, it's probably a feature!)

          Show
          Hoss Man added a comment - It's been so long, i don't really remember what i envisioned. I haven't had a chance to review the patches, but your description of the usecases looks great – my personal preference would be for an empty uf to default to not allowing any explicit fields, but i know i'm in the minority on that opinion, and your "-*" exclusion syntax makes it so easy to do i have absolutely no complaints. as for field name aliasing / virtual fields (ie: SOLR-3045 ) ... as i remember it, the underling "Alias" feature of the DisjunctionMaxQueryParser (i think that's what it's called) should work well for that – assuming the edismax usage of that underlying QueryParser doesn't circumvent it too much. As far as user syntax goes, i would suggest that the "per-field override" param syntax on the "qf" param would probably make the most sense here instead of using colons (and wouldn't require the special comma syntax you suggest in SOLR-3045 to specify multiple fields, which would prevent the general change yonik seems to want) ie... q=elephant title:dumbo who:george &qf=title^3 firstname lastname^2 description^2 catchall &uf=title^5 who^2 * &f.who.qf=firstname lastname^10 ...would cause "elephant" to be searched in all the "qf" fields with the specified boosts; "dumbo" to be searched only against the title field (with a boost of 5 since the user asked for that field explicitly); and "george" will get a DisjunctionMaxQuery with a boost of 2, containing two clauses: firstname (default boost of 1) and lastname (boost of 10). Basically: when parsing the "uf" look for a "f.$ {uf} .qf" param, and if it exists parse it and add the appropriate Alias. (fingers crossed it will be that easy: if it isn't, it's probably a feature!)
          Hide
          Jan Høydahl added a comment -

          I like the f.who.qf style. And the fact that you then can boost the whole DMQ clause as a whole.. I'll add that to SOLR-3045 as a suggestion.

          But it's a bit overkill to spin a DMQ for simple single-field-aliasing, i.e. my example &uf=title:searchable_title_t.
          Ideally such a simple field name aliasing should be supported on the Lucene parser level.
          Alternatively it could be another per-field param

          &f.title.fmap=searchable_title_t
          

          I'm still not sure how to use the built-in aliasing to implement this

          Show
          Jan Høydahl added a comment - I like the f.who.qf style. And the fact that you then can boost the whole DMQ clause as a whole.. I'll add that to SOLR-3045 as a suggestion. But it's a bit overkill to spin a DMQ for simple single-field-aliasing, i.e. my example &uf=title:searchable_title_t. Ideally such a simple field name aliasing should be supported on the Lucene parser level. Alternatively it could be another per-field param &f.title.fmap=searchable_title_t I'm still not sure how to use the built-in aliasing to implement this
          Hide
          Jan Høydahl added a comment -

          New patch implementing the field specific override "f.who.qf=" syntax for aliasing.

          It adds the aliases to the top level Edismax queryparser, so you can do stuff like

          &q=who:george&f.who.qf=firstname^2 middlename lastname^3
          &q=title:(breaking news)&f.title.qf=indexed_title_t
          

          What does NOT yet work is aliasing the fields referenced in qf param, so that you could use &qf=who where what. Have not found the correct hook for that. Anyone?

          Show
          Jan Høydahl added a comment - New patch implementing the field specific override "f.who.qf=" syntax for aliasing. It adds the aliases to the top level Edismax queryparser, so you can do stuff like &q=who:george&f.who.qf=firstname^2 middlename lastname^3 &q=title:(breaking news)&f.title.qf=indexed_title_t What does NOT yet work is aliasing the fields referenced in qf param, so that you could use &qf=who where what. Have not found the correct hook for that. Anyone?
          Hide
          Jan Høydahl added a comment -

          I think this is a good enough first step for user fields and aliasing feature. Then if we want to take it further such as recursive aliasing from inside qf params, we'll open new issues.

          Not sure I got full test coverage for all combinations of fields, default boosts etc.

          Anyone wants to review it? There's bound to be some bugs with such string manipulations...

          Show
          Jan Høydahl added a comment - I think this is a good enough first step for user fields and aliasing feature. Then if we want to take it further such as recursive aliasing from inside qf params, we'll open new issues. Not sure I got full test coverage for all combinations of fields, default boosts etc. Anyone wants to review it? There's bound to be some bugs with such string manipulations...
          Hide
          Jan Høydahl added a comment -

          One more thing we should probably fix. For schemas relying heavily on <dynamicField>s, it could be handy to allow/deny wildcard field names. Imagine:

           <dynamicField name="*name" type="string" />
           <dynamicField name="secret*" type="string" />
          
          With today's patch you'd have to explicitly allow and disallow full field names:
           &uf=firstname middlename lastname companyname -secrettext -secretsalary -secretfoo ....
          
          Better would be:
           &uf=*name -secret*
          
          Show
          Jan Høydahl added a comment - One more thing we should probably fix. For schemas relying heavily on <dynamicField>s, it could be handy to allow/deny wildcard field names. Imagine: <dynamicField name="*name" type="string" /> <dynamicField name="secret*" type="string" /> With today's patch you'd have to explicitly allow and disallow full field names: &uf=firstname middlename lastname companyname -secrettext -secretsalary -secretfoo .... Better would be: &uf=*name -secret*
          Hide
          Jan Høydahl added a comment -

          New patch adding "dynamic field" wildcard support to "uf". New test cases for these, and tests for verifying correct boosts applied from user fields.

          Show
          Jan Høydahl added a comment - New patch adding "dynamic field" wildcard support to "uf". New test cases for these, and tests for verifying correct boosts applied from user fields.
          Hide
          Jan Høydahl added a comment -

          Refactored the UserFields functionality into a separate inner class, for better readability and object orientation

          Show
          Jan Høydahl added a comment - Refactored the UserFields functionality into a separate inner class, for better readability and object orientation
          Hide
          Tomás Fernández Löbbe added a comment -

          Jan, I have a patch that fixes an issue with this implementation and adds some more test cases. It also addresses SOLR-3045, should I add it here or should I try to separate both patches? What I did for SOLR-3045 strongly depends in your code for this issue.
          The issue is for a case like:
          myalias:(Zapp Obnoxious)
          This query is parsed as
          myalias:Zapp default_field:Obnoxious
          instead of
          myalias:Zapp myalias:Obnoxious

          the specific tests I added are:

          assertQ(req("defType","edismax", "uf", "myalias", "q","myalias:(Zapp Obnoxious)", "f.myalias.qf","name^2.0 mytrait_ss^5.0", "mm", "50%"), oner);

          assertQ(req("defType","edismax", "uf","who", "q","who:(Zapp Obnoxious)", "f.who.qf", "name^2.0 trait_ss^5.0", "qf", "id"), twor);

          Show
          Tomás Fernández Löbbe added a comment - Jan, I have a patch that fixes an issue with this implementation and adds some more test cases. It also addresses SOLR-3045 , should I add it here or should I try to separate both patches? What I did for SOLR-3045 strongly depends in your code for this issue. The issue is for a case like: myalias:(Zapp Obnoxious) This query is parsed as myalias:Zapp default_field:Obnoxious instead of myalias:Zapp myalias:Obnoxious the specific tests I added are: assertQ(req("defType","edismax", "uf", "myalias", "q","myalias:(Zapp Obnoxious)", "f.myalias.qf","name^2.0 mytrait_ss^5.0", "mm", "50%"), oner); assertQ(req("defType","edismax", "uf","who", "q","who:(Zapp Obnoxious)", "f.who.qf", "name^2.0 trait_ss^5.0", "qf", "id"), twor);
          Hide
          Jan Høydahl added a comment -

          Hi Tomás. Thanks for the involvement!

          I have a feeling that we'll close SOLR-3045 and do everything in this issue.

          Please upload your improvements here, with same patch name, and we'll continue from there.

          Show
          Jan Høydahl added a comment - Hi Tomás. Thanks for the involvement! I have a feeling that we'll close SOLR-3045 and do everything in this issue. Please upload your improvements here, with same patch name, and we'll continue from there.
          Hide
          Tomás Fernández Löbbe added a comment -

          Still TBD: how to deal with infinite loop of aliasing. Right now this leads to stack overflow error.

          Show
          Tomás Fernández Löbbe added a comment - Still TBD: how to deal with infinite loop of aliasing. Right now this leads to stack overflow error.
          Hide
          Jan Høydahl added a comment -

          Hi, could you re-upload your patch without unrelated changes? Your patch includes severalwhite space/indenting/reformatting changes which is unrelated. This makes it hard to read the patch and see what's new.

          See http://wiki.apache.org/solr/HowToContribute#Creating_the_patch_file under "Please do not".

          How do you trigger an infinite loop?

          Show
          Jan Høydahl added a comment - Hi, could you re-upload your patch without unrelated changes? Your patch includes severalwhite space/indenting/reformatting changes which is unrelated. This makes it hard to read the patch and see what's new. See http://wiki.apache.org/solr/HowToContribute#Creating_the_patch_file under "Please do not". How do you trigger an infinite loop?
          Hide
          Tomás Fernández Löbbe added a comment -

          Sorry about that. This is the patch with the indentation changes.
          I also removed some lines that are now useless with the changes introduced to the parser.

          With infinite loops I mean cyclic aliasing. Something like:
          ...&qf=who&f.who.qf=name&f.name.qf=who

          This is probably due to an error and the solution is probably throw an exception, but right now this is not considered and will lead to a stack overflow error.

          Show
          Tomás Fernández Löbbe added a comment - Sorry about that. This is the patch with the indentation changes. I also removed some lines that are now useless with the changes introduced to the parser. With infinite loops I mean cyclic aliasing. Something like: ...&qf=who&f.who.qf=name&f.name.qf=who This is probably due to an error and the solution is probably throw an exception, but right now this is not considered and will lead to a stack overflow error.
          Hide
          Tomás Fernández Löbbe added a comment -

          Added cyclic aliasing validation

          Show
          Tomás Fernández Löbbe added a comment - Added cyclic aliasing validation
          Hide
          Jan Høydahl added a comment - - edited

          Super duper. I tested it and it works great! Strange, I could not get aliasing for qf fields to work before now. Now it works like a charm.

          A possible optimization would be to detect if the f.who.qf= contains just a single field, and create a simple TermQuery instead of a DisMaxQuery in that case. But it might not be important for performance..

          Another thing to re-consider is whether the default should be uf=* or uf=-*. If we aim to let edismax replace dismax, people may want it to behave like dismax out of the box. But if it won't replace dismax it's better to stick with the current defaults which people already are used to. Note that since eDismax is still @lucene.experimental we should not be afraid to change defaults.

          Show
          Jan Høydahl added a comment - - edited Super duper. I tested it and it works great! Strange, I could not get aliasing for qf fields to work before now. Now it works like a charm. A possible optimization would be to detect if the f.who.qf= contains just a single field, and create a simple TermQuery instead of a DisMaxQuery in that case. But it might not be important for performance.. Another thing to re-consider is whether the default should be uf=* or uf=-* . If we aim to let edismax replace dismax, people may want it to behave like dismax out of the box. But if it won't replace dismax it's better to stick with the current defaults which people already are used to. Note that since eDismax is still @lucene.experimental we should not be afraid to change defaults.
          Hide
          Tomás Fernández Löbbe added a comment - - edited

          I think the default should be uf=*, otherwise it will be confusing. I think "field" search together with "dismax" search will be one of the main reasons why people will move from other QP to edismax, and with uf=-* they will not get that behavior until they explicitly change it. I bet that if we use uf=-* we'll get many questions related to this in the mailing list.

          About the optimization, I think its a good idea, however it should be a different Jira. The optimization could be applied to f.who.qf as well as to qf= right?

          Show
          Tomás Fernández Löbbe added a comment - - edited I think the default should be uf=*, otherwise it will be confusing. I think "field" search together with "dismax" search will be one of the main reasons why people will move from other QP to edismax, and with uf=-* they will not get that behavior until they explicitly change it. I bet that if we use uf=-* we'll get many questions related to this in the mailing list. About the optimization, I think its a good idea, however it should be a different Jira. The optimization could be applied to f.who.qf as well as to qf= right?
          Hide
          Hoss Man added a comment -

          If we aim to let edismax replace dismax, people may want it to behave like dismax out of the box

          I don't think that should be the goal. plenty of people are using "edismax" already because they like the fact that it is a super set of the dismax & lucene features, and the defaults for "edismax" should embrace that.

          if/when EDisMaxQParser reaches the point that it can be configured to work exactly the same as DisMaxQParser, then it may be worth considering defaulting "dismax" => an EDisMaxQParser instance configured that way, but that doesn't mean "edismax" shouldn't expose all of it's bells and whistles by default.

          uf=* as a default should be fine – the only reason to question it would be if it was hard to disable, but the "-*" syntax is so easy it's not worth worrying about it.

          Show
          Hoss Man added a comment - If we aim to let edismax replace dismax, people may want it to behave like dismax out of the box I don't think that should be the goal. plenty of people are using "edismax" already because they like the fact that it is a super set of the dismax & lucene features, and the defaults for "edismax" should embrace that. if/when EDisMaxQParser reaches the point that it can be configured to work exactly the same as DisMaxQParser, then it may be worth considering defaulting "dismax" => an EDisMaxQParser instance configured that way, but that doesn't mean "edismax" shouldn't expose all of it's bells and whistles by default. uf=* as a default should be fine – the only reason to question it would be if it was hard to disable, but the "-*" syntax is so easy it's not worth worrying about it.
          Hide
          Tomás Fernández Löbbe added a comment -

          Is there anything else to do in order to get this committed?

          Show
          Tomás Fernández Löbbe added a comment - Is there anything else to do in order to get this committed?
          Hide
          Jan Høydahl added a comment -

          Just some minor polishings. Will commit this to trunk now.

          Show
          Jan Høydahl added a comment - Just some minor polishings. Will commit this to trunk now.
          Hide
          Jan Høydahl added a comment -

          Committed to trunk. Starting to backport

          Show
          Jan Høydahl added a comment - Committed to trunk. Starting to backport
          Hide
          Jan Høydahl added a comment -

          Attaching patch for branch_3x.

          Show
          Jan Høydahl added a comment - Attaching patch for branch_3x.
          Hide
          Jan Høydahl added a comment -

          Committed in branch_3x

          Show
          Jan Høydahl added a comment - Committed in branch_3x
          Hide
          Jan Høydahl added a comment -

          Created documentation page for eDisMax: http://wiki.apache.org/solr/ExtendedDisMax

          Show
          Jan Høydahl added a comment - Created documentation page for eDisMax: http://wiki.apache.org/solr/ExtendedDisMax

            People

            • Assignee:
              Jan Høydahl
              Reporter:
              Jan Høydahl
            • Votes:
              4 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development