Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9315

redefine (Classic & Standard) QueryParser semantics to be consistent: prioritize prefix op > infix op > default op


    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • core/queryparser
    • None
    • New


      For as long as I can remember, the way QueryParser deals with the "infix" operators AND & OR hasn't made much sense unless they are used consistently to express pure boolean logic (ie: always explicitly specified, and never more then 2 clauses to a query). As soon as you have query strings where a BooleanQuery has more then 2 clauses, or you have query strings that mix AND & OR with the "prefix" + & -|NOT operators, or query strings where not every clause has an operator, or (absolutely the most confusing) you mix the types of operators and change the QueryParser "default op" from OR to AND the behavior just becomes inpossible to make sense of for new users - and hard to explain/justify. (It's not precedence based, it's not left to right, it's just ... weird.)

      The problem is so confusing to new users, that I wrote a blog post almost 10 years ago (?!?) trying to convince people that using AND & OR was a terrible idea unless they were used only in strict boolean expressions)...


      ...and yet it still regularly comes up as a point of confusion.

      A lot this weird behavior seems to be historical artifact of how QueryParserBase.addClauses() works - a method whose basic semantics haven't really changed since Lucne 1.0.1, back before the introductiong of QueryParser.setDefaultOperator(). Some of those early choices seemed to be predicated on the idea that AND should take "precedence" (i use that term loosely) over OR as it parses clauses left to right, purely becuase OR was the "default" assumption (and had - and stll has - no corrisponding "prefix" operator). As functionality in QueryParser has grown, a lot of the assumptions made in the code and the resulting parse behavior really make no sense to users, particularly in "non trivial" query strings. In many cases, parse behavior that can seem "intentional" to new users, even for input where every clause is impacted by an explicit AND or OR operators, can suddenly be flipped on it's head when the "default operator" is changed (ex: "X AND Y OR Z"), or if the only the order of "clauses" in the string changes (ex: previous example vs "Z OR Y AND X") even though it's clear from other queries that there is no strict precedence of operators.

      The "root" of the problem, as I see it, is that QueryParserBase.addClauses() allows AND & OR to modify the Occur property of the previously parsed BooleanClause depending on if that BooleanClause.getOccur() value matches the "default operator" for the parser, w/o any considerationg to why that that getOccur() value matches the "default operator" - ie: did it actually come from the "default" or was it explicitly set by something in the query string? (ie: a prior infix operator)

      I propose that starting with Lucene 9.0, we redefine the semantics in QueryParserBase such that:

      • "Prefix" operators (+ | - | NOT) always take precedence (over any "Infix" operator or QueryParser default) in setting the Occur value of the clause they prefix.
      • "Infix" operators (AND | OR) are evaluated left to right and used to set the Occur value of the clauses adjacent to them (that do not already have a Occur value set by a "Pefix" operator)
      • the QueryParser.getDefaultOperator() is only used to set the Occur value of any clause that did not get an Occur value assigned by either a prefix or (prior) infix operator.


        1. LUCENE-9315.patch
          15 kB
          Chris M. Hostetter



            Unassigned Unassigned
            hossman Chris M. Hostetter
            0 Vote for this issue
            2 Start watching this issue