[LUCENE-1606] Automaton Query/Filter (scalable regex) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.0-ALPHA
Component/s: core/search
Labels:
None

Lucene Fields:

New, Patch Available

Description

Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:
1. lexicography/etc on large text corpora
2. looking for things such as urls where the prefix is not constant (http:// or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

1. Look at the portion that is OK (did not enter a reject state in the DFA)
2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1606-flex.patch
08/Dec/09 10:50
211 kB
Uwe Schindler
LUCENE-1606-flex.patch
07/Dec/09 22:20
216 kB
Robert Muir
LUCENE-1606-flex.patch
07/Dec/09 18:42
212 kB
Robert Muir
LUCENE-1606-flex.patch
07/Dec/09 14:34
212 kB
Robert Muir
LUCENE-1606-flex.patch
06/Dec/09 22:47
213 kB
Robert Muir
LUCENE-1606-flex.patch
06/Dec/09 22:29
213 kB
Robert Muir
LUCENE-1606.patch
06/Dec/09 03:02
213 kB
Robert Muir
LUCENE-1606.patch
06/Dec/09 02:43
213 kB
Robert Muir
LUCENE-1606-flex.patch
05/Dec/09 20:59
230 kB
Uwe Schindler
LUCENE-1606-flex.patch
05/Dec/09 20:19
276 kB
Uwe Schindler
LUCENE-1606-flex.patch
05/Dec/09 15:15
276 kB
Uwe Schindler
LUCENE-1606-flex.patch
04/Dec/09 18:12
234 kB
Robert Muir
LUCENE-1606.patch
04/Dec/09 14:51
214 kB
Robert Muir
LUCENE-1606.patch
02/Dec/09 22:20
204 kB
Robert Muir
LUCENE-1606.patch
24/Nov/09 22:20
211 kB
Robert Muir
LUCENE-1606.patch
24/Nov/09 18:45
211 kB
Robert Muir
LUCENE-1606.patch
24/Nov/09 12:26
208 kB
Robert Muir
LUCENE-1606-flex.patch
22/Nov/09 23:37
212 kB
Robert Muir
LUCENE-1606-flex.patch
22/Nov/09 18:17
197 kB
Michael McCandless
BenchWildcard.java
21/Nov/09 18:20
4 kB
Robert Muir
LUCENE-1606.patch
21/Nov/09 17:41
198 kB
Robert Muir
LUCENE-1606.patch
21/Nov/09 15:22
198 kB
Robert Muir
LUCENE-1606.patch
21/Nov/09 12:38
200 kB
Uwe Schindler
LUCENE-1606.patch
21/Nov/09 12:25
199 kB
Uwe Schindler
LUCENE-1606.patch
21/Nov/09 00:55
198 kB
Uwe Schindler
LUCENE-1606.patch
20/Nov/09 19:20
192 kB
Robert Muir
LUCENE-1606_nodep.patch
20/Nov/09 16:18
194 kB
Robert Muir
LUCENE-1606.patch
13/Oct/09 18:17
58 kB
Robert Muir
LUCENE-1606.patch
28/Apr/09 18:55
47 kB
Robert Muir
automatonmultiqueryfuzzy.patch
19/Apr/09 21:36
47 kB
Robert Muir
automatonMultiQuerySmart.patch
19/Apr/09 05:43
35 kB
Robert Muir
automatonMultiQuery.patch
18/Apr/09 00:24
34 kB
Robert Muir
automatonWithWildCard2.patch
16/Apr/09 11:43
36 kB
Robert Muir
automatonWithWildCard.patch
16/Apr/09 11:24
36 kB
Robert Muir
automaton.patch
16/Apr/09 08:48
19 kB
Robert Muir

Issue Links

depends upon

LUCENE-2111 Wrapup flexible indexing

Closed

is depended upon by

LUCENE-2090 convert automaton to char[] based processing and TermRef / TermsEnum api

Closed

is related to

LUCENE-2110 Refactoring of FilteredTermsEnum and MultiTermQuery

Closed

Activity

People

Assignee:: Robert Muir

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Apr/09 08:47

Updated:: 28/Aug/22 11:59

Resolved:: 09/Dec/09 17:46