[LUCENE-1567] New flexible query parser - ASF JIRA

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.9
Component/s: core/queryparser
Labels:
None
Environment:

N/A

Lucene Fields:

New

Description

From "New flexible query parser" thread by Micheal Busch

in my team at IBM we have used a different query parser than Lucene's in
our products for quite a while. Recently we spent a significant amount
of time in refactoring the code and designing a very generic
architecture, so that this query parser can be easily used for different
products with varying query syntaxes.

This work was originally driven by Andreas Neumann (who, however, left
our team); most of the code was written by Luis Alves, who has been a
bit active in Lucene in the past, and Adriano Campos, who joined our
team at IBM half a year ago. Adriano is Apache committer and PMC member
on the Tuscany project and getting familiar with Lucene now too.

We think this code is much more flexible and extensible than the current
Lucene query parser, and would therefore like to contribute it to
Lucene. I'd like to give a very brief architecture overview here,
Adriano and Luis can then answer more detailed questions as they're much
more familiar with the code than I am.
The goal was it to separate syntax and semantics of a query. E.g. 'a AND
b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
We distinguish the semantics of the different query components, e.g.
whether and how to tokenize/lemmatize/normalize the different terms or
which Query objects to create for the terms. We wanted to be able to
write a parser with a new syntax, while reusing the underlying
semantics, as quickly as possible.
In fact, Adriano is currently working on a 100% Lucene-syntax compatible
implementation to make it easy for people who are using Lucene's query
parser to switch.

The query parser has three layers and its core is what we call the
QueryNodeTree. It is a tree that initially represents the syntax of the
original query, e.g. for 'a AND b':
AND
/ \
A B

The three layers are:
1. QueryParser
2. QueryNodeProcessor
3. QueryBuilder

1. The upper layer is the parsing layer which simply transforms the
query text string into a QueryNodeTree. Currently our implementations of
this layer use javacc.
2. The query node processors do most of the work. It is in fact a
configurable chain of processors. Each processors can walk the tree and
modify nodes or even the tree's structure. That makes it possible to
e.g. do query optimization before the query is executed or to tokenize
terms.
3. The third layer is also a configurable chain of builders, which
transform the QueryNodeTree into Lucene Query objects.

Furthermore the query parser uses flexible configuration objects, which
are based on AttributeSource/Attribute. It also uses message classes that
allow to attach resource bundles. This makes it possible to translate
messages, which is an important feature of a query parser.

This design allows us to develop different query syntaxes very quickly.
Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
underlying processors and builders in a few days. We now have a 100%
compatible Lucene query parser, which means the syntax is identical and
all query parser test cases pass on the new one too using a wrapper.

Recent posts show that there is demand for query syntax improvements,
e.g improved range query syntax or operator precedence. There are
already different QP implementations in Lucene+contrib, however I think
we did not keep them all up to date and in sync. This is not too
surprising, because usually when fixes and changes are made to the main
query parser, people don't make the corresponding changes in the contrib
parsers. (I'm guilty here too)
With this new architecture it will be much easier to maintain different
query syntaxes, as the actual code for the first layer is not very much.
All syntaxes would benefit from patches and improvements we make to the
underlying layers, which will make supporting different syntaxes much
more manageable.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

lucene_1567_adriano_crestani_07_13_2009.patch
14/Jul/09 00:22
882 kB
Adriano Crestani
lucene_trunk_FlexQueryParser_2009July09_v4.patch
10/Jul/09 05:38
738 kB
Luis Alves
lucene_trunk_FlexQueryParser_2009July10_v5.patch
11/Jul/09 01:40
738 kB
Luis Alves
lucene_trunk_FlexQueryParser_2009july15_v6.patch
15/Jul/09 14:46
848 kB
Luis Alves
lucene_trunk_FlexQueryParser_2009july16_v7.patch
16/Jul/09 09:55
870 kB
Adriano Crestani
lucene_trunk_FlexQueryParser_2009july23_v8.patch
24/Jul/09 05:32
837 kB
Adriano Crestani
lucene_trunk_FlexQueryParser_2009july27_v9.patch
28/Jul/09 04:47
862 kB
Luis Alves
lucene_trunk_FlexQueryParser_2009july28_v10.patch
29/Jul/09 04:50
900 kB
Adriano Crestani
lucene_trunk_FlexQueryParser_2009july30_v12.patch
30/Jul/09 07:48
907 kB
Adriano Crestani
lucene_trunk_FlexQueryParser_2009july31_v14.patch
01/Aug/09 00:15
809 kB
Luis Alves
lucene_trunk_FlexQueryParser_2009March24.patch
24/Mar/09 09:05
894 kB
Luis Alves
lucene_trunk_FlexQueryParser_2009March26_v3.patch
27/Mar/09 00:07
660 kB
Luis Alves
lucene-1567.patch
27/Jul/09 23:36
855 kB
Michael Busch
new_query_parser_src.tar
06/Jul/09 05:16
610 kB
Michael Busch
QueryParser_restructure_meetup_june2009_v2.pdf
06/Jun/09 00:54
76 kB
Luis Alves
wiki_switching_to_the_new_query_parser.txt
17/Jul/09 23:54
2 kB
Adriano Crestani

Issue Links

blocks

LUCENE-1768 NumericRange support for new query parser

Closed

LUCENE-1486 Wildcards, ORs etc inside Phrase queries

Closed

is depended upon by

LUCENE-1782 Rename OriginalQueryParserHelper

Closed

relates to

LUCENE-1852 Fix remaining localization test failures in lucene

Reopened

LUCENE-1836 Flexible QueryParser fails with local different from en_US

Closed

LUCENE-588 Escaped wildcard character in wildcard term not handled correctly

Closed

LUCENE-1792 new QueryParser fails to set AUTO REWRITE for multi-term queries

Closed

LUCENE-1797 new QueryParser over-increment position for MultiPhraseQuery

Closed

LUCENE-1823 QueryParser with new features for Lucene 3

Open

LUCENE-1820 WildcardQueryNode to expose the positions of the wildcard characters, for easier use in processors and builders

Open

LUCENE-995 Add open ended range query syntax to QueryParser

Closed

LUCENE-1829 'ant javacc' in root project should also properly create contrib/queryparser Java files

Closed

(7 relates to)

New flexible query parser

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates