Lucene - Core
  1. Lucene - Core
  2. LUCENE-5336

Add a simple QueryParser to parse human-entered queries.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I would like to add a new simple QueryParser to Lucene that is designed to parse human-entered queries. This parser will operate on an entire entered query using a specified single field or a set of weighted fields (using term boost).

      All features/operations in this parser can be enabled or disabled depending on what is necessary for the user. A default operator may be specified as either 'MUST' representing 'and' or 'SHOULD' representing 'or.' The features/operations that this parser will include are the following:

      • AND specified as '+'
      • OR specified as '|'
      • NOT specified as '-'
      • PHRASE surrounded by double quotes
      • PREFIX specified as '*'
      • PRECEDENCE surrounded by '(' and ')'
      • WHITESPACE specified as ' ' '\n' '\r' and '\t' will cause the default operator to be used
      • ESCAPE specified as '\' will allow operators to be used in terms

      The key differences between this parser and other existing parsers will be the following:

      • No exceptions will be thrown, and errors in syntax will be ignored. The parser will do a best-effort interpretation of any query entered.
      • It uses minimal syntax to express queries. All available operators are single characters or pairs of single characters.
      • The parser is hand-written and in a single Java file making it easy to modify.
      1. LUCENE-5336.patch
        46 kB
        Robert Muir
      2. LUCENE-5336.patch
        45 kB
        Jack Conradson
      3. LUCENE-5336.patch
        45 kB
        Jack Conradson

        Activity

        Hide
        Jack Conradson added a comment -

        I have attached a patch for this JIRA.

        Show
        Jack Conradson added a comment - I have attached a patch for this JIRA.
        Hide
        Michael McCandless added a comment -

        This is AWESOME. I love how the operators (even whitespace!) are
        optional. And I love the name And it's great that it NEVER throws
        an exc no matter how awful the input is. And I love that it does not
        use a lexer/parser generator: this makes it much more approachable
        to those devs that don't have experience with parser generators.

        Small javadoc fix: instead of "any

        {@code -} characters beyond the
        first character in a term may not need to be escaped," I think it
        should say "any {@code -}

        characters beyond the first character do not
        need to be escaped" (and same for * operator)"?

        How does it handle mal-formed input, e.g. a missing closing " for a
        phrase query? If I enter "foo bar will it just make a term query for
        "foo and a term query for bar? Or, does it strip that " and do query
        foo instead? (Same for missing closing paren?). It looks like it
        drops the " and ( and does a simple term query (good).

        Maybe you could add fangs to the random test by more frequently mixing
        in these operator characters ...

        Show
        Michael McCandless added a comment - This is AWESOME. I love how the operators (even whitespace!) are optional. And I love the name And it's great that it NEVER throws an exc no matter how awful the input is. And I love that it does not use a lexer/parser generator: this makes it much more approachable to those devs that don't have experience with parser generators. Small javadoc fix: instead of "any {@code -} characters beyond the first character in a term may not need to be escaped," I think it should say "any {@code -} characters beyond the first character do not need to be escaped" (and same for * operator)"? How does it handle mal-formed input, e.g. a missing closing " for a phrase query? If I enter "foo bar will it just make a term query for "foo and a term query for bar? Or, does it strip that " and do query foo instead? (Same for missing closing paren?). It looks like it drops the " and ( and does a simple term query (good). Maybe you could add fangs to the random test by more frequently mixing in these operator characters ...
        Hide
        Paul Elschot added a comment -

        A realistic query parser is not likely to be any simpler than this, so why not call it "simple"?

        Show
        Paul Elschot added a comment - A realistic query parser is not likely to be any simpler than this, so why not call it "simple"?
        Hide
        Jack Conradson added a comment -

        Thanks for the feedback.

        To answer the malformed input question –

        If
        "foo bar
        is given as the query, the double quote will be dropped, and if whitespace is an operator it will make term queries for both 'foo' and 'bar' otherwise it will make a single term query 'foo bar'
        If
        foo"bar
        is given as the query, the double quote will be dropped, and term queries will be made for both 'foo' and 'bar'

        The reason it's done this way is because the parser only backtracks as far as the malformed input (in this case the extraneous double quote), so 'foo' would already be part of the query tree. This is because only a single pass is made for each query. The parser could be changed to do two passes to remove extraneous characters, but I believe that only makes the code more complex, and doesn't necessarily interpret the query any better for a user since the malformed character gives no hint as to what he/she really intended to do.

        I will try to post another patch today or tomorrow.

        I plan to do the following:

        • Fix the Javadoc comment
        • Add more tests for random operators
        • Rename the class to SimpleQueryParser and rename the package to .simple
        Show
        Jack Conradson added a comment - Thanks for the feedback. To answer the malformed input question – If "foo bar is given as the query, the double quote will be dropped, and if whitespace is an operator it will make term queries for both 'foo' and 'bar' otherwise it will make a single term query 'foo bar' If foo"bar is given as the query, the double quote will be dropped, and term queries will be made for both 'foo' and 'bar' The reason it's done this way is because the parser only backtracks as far as the malformed input (in this case the extraneous double quote), so 'foo' would already be part of the query tree. This is because only a single pass is made for each query. The parser could be changed to do two passes to remove extraneous characters, but I believe that only makes the code more complex, and doesn't necessarily interpret the query any better for a user since the malformed character gives no hint as to what he/she really intended to do. I will try to post another patch today or tomorrow. I plan to do the following: Fix the Javadoc comment Add more tests for random operators Rename the class to SimpleQueryParser and rename the package to .simple
        Hide
        Jack Conradson added a comment -

        Attached an updated version of the patch with the three modifications from my previous comment.

        Show
        Jack Conradson added a comment - Attached an updated version of the patch with the three modifications from my previous comment.
        Hide
        Adrien Grand added a comment -

        Javadocs and code seem to disagree on the default operator: javadocs say The default operator is AND if no other operator is specified. while the code has private BooleanClause.Occur defaultOperator = BooleanClause.Occur.SHOULD;?

        Otherwise I agree with Mike that this new query parser is awesome. I will certainly use it!

        Show
        Adrien Grand added a comment - Javadocs and code seem to disagree on the default operator: javadocs say The default operator is AND if no other operator is specified. while the code has private BooleanClause.Occur defaultOperator = BooleanClause.Occur.SHOULD; ? Otherwise I agree with Mike that this new query parser is awesome. I will certainly use it!
        Hide
        Robert Muir added a comment -

        I took a swipe at trying to make the javadocs easier to read (just different layout).

        Also folded in Adrien's fix.

        Show
        Robert Muir added a comment - I took a swipe at trying to make the javadocs easier to read (just different layout). Also folded in Adrien's fix.
        Hide
        Michael McCandless added a comment -

        +1, javadocs and the new test look great!

        Show
        Michael McCandless added a comment - +1, javadocs and the new test look great!
        Hide
        Adrien Grand added a comment -

        +1

        Show
        Adrien Grand added a comment - +1
        Hide
        ASF subversion and git services added a comment -

        Commit 1541151 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1541151 ]

        LUCENE-5336: add SimpleQueryParser for human-entered queries

        Show
        ASF subversion and git services added a comment - Commit 1541151 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1541151 ] LUCENE-5336 : add SimpleQueryParser for human-entered queries
        Hide
        ASF subversion and git services added a comment -

        Commit 1541158 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1541158 ]

        LUCENE-5336: add SimpleQueryParser for human-entered queries

        Show
        ASF subversion and git services added a comment - Commit 1541158 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1541158 ] LUCENE-5336 : add SimpleQueryParser for human-entered queries
        Hide
        Robert Muir added a comment -

        Thanks Jack!

        Show
        Robert Muir added a comment - Thanks Jack!
        Hide
        ASF subversion and git services added a comment -

        Commit 1557073 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1557073 ]

        LUCENE-5336, LUCENE-5376: expose SimpleQueryParser in lucene server

        Show
        ASF subversion and git services added a comment - Commit 1557073 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1557073 ] LUCENE-5336 , LUCENE-5376 : expose SimpleQueryParser in lucene server
        Hide
        Marcio Napoli added a comment -

        Believe to be interesting to include support for prefix/suffix (term* or term) and also the data range [20120910 TO 20130101]? Thanks!

        Show
        Marcio Napoli added a comment - Believe to be interesting to include support for prefix/suffix (term* or term ) and also the data range [20120910 TO 20130101] ? Thanks!

          People

          • Assignee:
            Unassigned
            Reporter:
            Jack Conradson
          • Votes:
            3 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development