I've been following this for at least two years. See my comment from February 2012, above. I can't tell if the proposed fix is a fix. We ought to have the goal: the system behaves in a deterministic way that can be explained to users and that, as little as possible, acts in ways contrary to user expectations (especially silently). The failure to abide by this principle is what made this issue so troubling to me, because users could know that whitespace would be interpreted as "AND" yet they would still get results that discarded the effect that operator should have had.
Now, of course, users make mistakes. They submit ambiguous queries (or in the case of mm=100% for a disjunctive query, I guess we could call that a self-defeating or self-contradictory query--if I understand mm correctly).
I still think that what is really needed is (1) a set of default rules for interpreting ambiguous queries that will always provide a deterministic result. These rules could be explained to users, and then what is also needed is that (2) when a user does something that doesn't make sense, given these default rules, they should get an error message.
The ambiguous query discussed above was one where whitespace was set to "AND" and a user entered:
(A or B or C) "D E"
Such a user must be assuming that whitespace within quotation marks is ignored, i.e., that the quotation marks make "D E" a single term that must be matched exactly and that, given the default to conjunction for non-quoted whitespace, that her query will be parsed as:
(A or B or C) AND "D E"
that is, as a conjunction with two conjuncts, thus requiring that each conjunct be satisfied to get a matching result.
My first question then is, what will happen to this query under the new patch? Will it be interpreted as expected?
My second question is, why not adopt a set of default rules for ambiguous queries? Like the default order of operations in arithmetic, we simply need a convention to interpret 3 + 5 x 4 as 3 + (5 x 4). Just as it didn't matter in arithmetic which operators we favored, so long as everyone knows the convention, it also doesn't really matter what rules we adopt here, so long as we publicize them so users and maintainers know what to expect. I would propose the following:
1. Whitespace within quotation marks is ignored (in that it is not turned into an operator), that is "D E" is interpreted as a single term that must match exactly.
2. If a query lacks sufficient parentheses to create an unambiguous query, then the following rules will be applied:
a. Insert parentheses around every occurrence of AND and its two conjuncts, starting with the rightmost AND.
b. Insert parentheses in the same fashion for OR.
c. Right parentheses are never inserted within another set of parentheses, i.e., no existing pairings are broken up.
3. If one's query is nonsensical, an error message will be displayed explaining the problem. For example, if one has set mm to 100%, requiring every term to match, but yet one also issues a disjunctive query (A OR B) that would be satisfied if either term were to match, then one receives an error indicating that mm cannot be set to 100% while issuing a disjunctive query.
I think those rules would be sufficient to resolve all ambiguous queries and the general idea that "If you leave out parentheses, then they'll be added to the smallest available units starting from the right, and starting with conjunction" is one that users could (somewhat) easily grasp.
But, as I said in 2012, my grasp on how solr handles mm is tenuous at best, so perhaps someone will explain that I'm misunderstanding something important.