Solr
  1. Solr
  2. SOLR-3580

In ExtendedDismax, lowercase 'not' operator is not being treated as an operator when 'lowercaseOperators' is enabled

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: query parsers
    • Labels:
      None

      Description

      When lowercase operator support is enabled (for edismax), the lowercase 'not' operator is being wrongly treated as a literal term (and not as an operator).

      1. SOLR-3580-proposal.patch
        34 kB
        Michael Dodsworth
      2. SOLR-3580.patch
        8 kB
        Michael Dodsworth

        Issue Links

          Activity

          Hide
          Eric Pugh added a comment -

          I was about to submit a patch for the fact that 'NOT' and 'not' don't work the same, when I stumbled across this issue. My patch file looks rather remarkably like Michael Dodsworth first patch as well!

          One thing is that the wiki needs an update: http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators I can put that in, referring to the patch files as option if you need not:NOT support.

          I would like to see something committed, as my customer has the same need for NOT to work. Their users are sophisticated, know the syntax etc. Backup plan is to do something custom.

          Show
          Eric Pugh added a comment - I was about to submit a patch for the fact that 'NOT' and 'not' don't work the same, when I stumbled across this issue. My patch file looks rather remarkably like Michael Dodsworth first patch as well! One thing is that the wiki needs an update: http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators I can put that in, referring to the patch files as option if you need not:NOT support. I would like to see something committed, as my customer has the same need for NOT to work. Their users are sophisticated, know the syntax etc. Backup plan is to do something custom.
          Hide
          Jan Høydahl added a comment -

          Any progress on this?

          A more compact config format would be possible too:

          validOperators=or:OR,or and:AND,and not:NOT
          

          Then, if we add more operators later, such as NEAR, the same param would apply. Perhaps we also could use this to disable explicit operators alltogether by allowing an empty list validOperators=or: and: not: - for people who want to lock things down.

          As for interaction with lowercaseOperators, I'm happy to let it hang around for 4.x, but throw an exception if both params are defined together.

          Show
          Jan Høydahl added a comment - Any progress on this? A more compact config format would be possible too: validOperators=or:OR,or and:AND,and not:NOT Then, if we add more operators later, such as NEAR, the same param would apply. Perhaps we also could use this to disable explicit operators alltogether by allowing an empty list validOperators=or: and: not: - for people who want to lock things down. As for interaction with lowercaseOperators, I'm happy to let it hang around for 4.x, but throw an exception if both params are defined together.
          Hide
          Hoss Man added a comment -

          removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)

          Show
          Hoss Man added a comment - removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue. (this can certainly be revisited if volunteers step forward)
          Hide
          Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          Michael Dodsworth added a comment -

          Thanks for the feedback, Jack/Yonik.

          1 - support for mixed-case operators is as before: they are interpreted as operators. Having said that, there appears to be an subtle bug with the 'mm' toggling behaviour. The operator counting (used to determine whether 'mm' needs to be disabled) only accepts strict uppercase and lowercase, whereas the query rebuild accepts mixed-case. I can also fix that up and add a test.

          2 - the 'supportedLowercaseOperators' parameter would be in addition to 'lowercaseOperators', rather than replacing it. If 'lowercaseOperators' is true, we look for a 'supportedLowercaseOperators' value. If no value is provided, we use the default (and, or), which means we have backwards compatibility.

          Yonik - yeah, Jan's proposal is absolutely the most flexible. I guess my concerns were:

          • that it might snowball into wanting to have an external, stopword-esk file for per-language operator support (minor concern)
          • that we'd lose some backwards compatibility, as currently mixed-case operators are supported (although the default set could be expanded to accommodate this, if needed)
          • the interaction between the 'lowercaseOperators' parameter and 'valid*' might get a little funky. For example, if we simply ignore 'lowercaseOperators' when a 'valid*' parameter is present, there is no potential for confusion BUT toggling lowercase operator support per query then becomes a head-ache (as the upstream client needs to pass through the supported uppercase operators). If we allow interaction between 'lowercaseOperators' and 'valid*', which parameter takes priority? To allow toggling per-query, lowercaseOperators should take priority. Perhaps a good dollop of documentation would be enough here

          Let me extend the patch to switch-over to Jan's proposal so people can take a look.
          Cheers,

          Show
          Michael Dodsworth added a comment - Thanks for the feedback, Jack/Yonik. 1 - support for mixed-case operators is as before: they are interpreted as operators. Having said that, there appears to be an subtle bug with the 'mm' toggling behaviour. The operator counting (used to determine whether 'mm' needs to be disabled) only accepts strict uppercase and lowercase, whereas the query rebuild accepts mixed-case. I can also fix that up and add a test. 2 - the 'supportedLowercaseOperators' parameter would be in addition to 'lowercaseOperators', rather than replacing it. If 'lowercaseOperators' is true, we look for a 'supportedLowercaseOperators' value. If no value is provided, we use the default (and, or), which means we have backwards compatibility. Yonik - yeah, Jan's proposal is absolutely the most flexible. I guess my concerns were: that it might snowball into wanting to have an external, stopword-esk file for per-language operator support (minor concern) that we'd lose some backwards compatibility, as currently mixed-case operators are supported (although the default set could be expanded to accommodate this, if needed) the interaction between the 'lowercaseOperators' parameter and 'valid*' might get a little funky. For example, if we simply ignore 'lowercaseOperators' when a 'valid*' parameter is present, there is no potential for confusion BUT toggling lowercase operator support per query then becomes a head-ache (as the upstream client needs to pass through the supported uppercase operators). If we allow interaction between 'lowercaseOperators' and 'valid*', which parameter takes priority? To allow toggling per-query, lowercaseOperators should take priority. Perhaps a good dollop of documentation would be enough here Let me extend the patch to switch-over to Jan's proposal so people can take a look. Cheers,
          Hide
          Yonik Seeley added a comment -

          If we are to do anything here, I think I like Jan's proposal best.

          Show
          Yonik Seeley added a comment - If we are to do anything here, I think I like Jan's proposal best.
          Hide
          Jack Krupansky added a comment -

          I'm okay with the proposal for "supportedLowercaseOperators", but I would make two points:

          1. What about mixed case? Would mixed case And/Or/Not be treated as lower case? Or only strict lower case?
          2. How about a revised compromise - Keep "lowercaseOperators=true/false" as is for compatibility, but also support the same parameter name with the form "lowercaseOperators=and,or,not" to provide the desired flexibility. All the benefit of the current proposal, but without the need to switch/deprecate the old form.

          Show
          Jack Krupansky added a comment - I'm okay with the proposal for "supportedLowercaseOperators", but I would make two points: 1. What about mixed case? Would mixed case And/Or/Not be treated as lower case? Or only strict lower case? 2. How about a revised compromise - Keep "lowercaseOperators=true/false" as is for compatibility, but also support the same parameter name with the form "lowercaseOperators=and,or,not" to provide the desired flexibility. All the benefit of the current proposal, but without the need to switch/deprecate the old form.
          Hide
          Michael Dodsworth added a comment -

          Does this seem like a reasonable direction to everyone?

          Show
          Michael Dodsworth added a comment - Does this seem like a reasonable direction to everyone?
          Hide
          Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          Michael Dodsworth added a comment -

          adding the 'supportedLowercaseOperator' parameter I mentioned.

          Also cleared up a few unused vars, assignments, etc.

          Requires more tests but I'm interested to hear what people think.

          Show
          Michael Dodsworth added a comment - adding the 'supportedLowercaseOperator' parameter I mentioned. Also cleared up a few unused vars, assignments, etc. Requires more tests but I'm interested to hear what people think.
          Hide
          Michael Dodsworth added a comment -

          one option (that sits somewhere between the 2 proposed solutions) may be to have a 'supportedLowercaseOperators' setting that takes a comma-separated list of supported operators. If no override is provided, the default behaviour would be to accept '[and,or]'.

           
          <str name="supportedLowercaseOperators">and,or,not</str>
          

          Let me get a patch together so people can take a look.

          Show
          Michael Dodsworth added a comment - one option (that sits somewhere between the 2 proposed solutions) may be to have a 'supportedLowercaseOperators' setting that takes a comma-separated list of supported operators. If no override is provided, the default behaviour would be to accept ' [and,or] '. <str name= "supportedLowercaseOperators" > and,or,not </str> Let me get a patch together so people can take a look.
          Hide
          Jan Høydahl added a comment -

          Or redo the whole thing in a more generic way:

          validOrOperators=OR,or     (default=OR)
          validAndOperators=AND,and  (default=AND)
          validNotOperators=NOT      (default=NOT)
          

          This way people cannot only add lowercase variants if wanted, but they can also translate to their own language (Norwegian: ELLER / OG / IKKE) and take more control alltogether; The old lowercaseOperators would be deprecated and removed in next version.

          Show
          Jan Høydahl added a comment - Or redo the whole thing in a more generic way: validOrOperators=OR,or (default=OR) validAndOperators=AND,and (default=AND) validNotOperators=NOT (default=NOT) This way people cannot only add lowercase variants if wanted, but they can also translate to their own language (Norwegian: ELLER / OG / IKKE) and take more control alltogether; The old lowercaseOperators would be deprecated and removed in next version.
          Hide
          Jack Krupansky added a comment -

          My recommendation is to have an additional option, "lowercaseNotOperator" which defaults to "false". This would be the safe choice that Yonik recommends, but allow you to override that decision as you see fit for your application.

          Show
          Jack Krupansky added a comment - My recommendation is to have an additional option, "lowercaseNotOperator" which defaults to "false". This would be the safe choice that Yonik recommends, but allow you to override that decision as you see fit for your application.
          Hide
          Michael Dodsworth added a comment -

          were we not allowing the user to explicitly specify that they want to support lowercase operators, I might agree.

          That setting should (at the very least) come with a clear health warning so that more people aren't caught out by this.

          Show
          Michael Dodsworth added a comment - were we not allowing the user to explicitly specify that they want to support lowercase operators, I might agree. That setting should (at the very least) come with a clear health warning so that more people aren't caught out by this.
          Hide
          Yonik Seeley added a comment -

          edismax is about heuristics and sometimes guessing user intent... if exact/strict syntax is desired, the lucene query parser is a better fit.

          Show
          Yonik Seeley added a comment - edismax is about heuristics and sometimes guessing user intent... if exact/strict syntax is desired, the lucene query parser is a better fit.
          Hide
          Michael Dodsworth added a comment -

          surely that's a more general hazard with supporting lowercase operators. It seems strange to give 'not' special treatment. There are likely are examples where having 'and' or 'or' wrongly treated as a operator /is/ catastrophic, therefore the onus should be on the client to choose the correct 'lowercaseOperator' option for their use-case.

          Show
          Michael Dodsworth added a comment - surely that's a more general hazard with supporting lowercase operators. It seems strange to give 'not' special treatment. There are likely are examples where having 'and' or 'or' wrongly treated as a operator /is/ catastrophic, therefore the onus should be on the client to choose the correct 'lowercaseOperator' option for their use-case.
          Hide
          Yonik Seeley added a comment -

          This is by design. Treating "and" and "or" as operators when people may not realize they are is much less catastrophic than treating not as an operator.

          If someone searches for
          to be or not to be
          excluding all documents with "to" in them is very bad.

          Show
          Yonik Seeley added a comment - This is by design. Treating "and" and "or" as operators when people may not realize they are is much less catastrophic than treating not as an operator. If someone searches for to be or not to be excluding all documents with "to" in them is very bad.
          Hide
          Michael Dodsworth added a comment -

          patch includes:

          • fix to edismax
          • replacement operator test (covers upper and lowercase 'AND', 'OR' and 'NOT' operators with 'lowercaseOperators' enabled/disabled)
          • small clear-up in the test (adding @Test annotations and removing redundant 'throws IOException's)
          Show
          Michael Dodsworth added a comment - patch includes: fix to edismax replacement operator test (covers upper and lowercase 'AND', 'OR' and 'NOT' operators with 'lowercaseOperators' enabled/disabled) small clear-up in the test (adding @Test annotations and removing redundant 'throws IOException's)

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael Dodsworth
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development