Lucene - Core
  1. Lucene - Core
  2. LUCENE-1606

Automaton Query/Filter (scalable regex)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

      Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

      Some use cases I envision:
      1. lexicography/etc on large text corpora
      2. looking for things such as urls where the prefix is not constant (http:// or ftp://)

      The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

      The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

      1. Look at the portion that is OK (did not enter a reject state in the DFA)
      2. Generate the next possible String and seek to that.

      the Query simply wraps the filter with ConstantScoreQuery.

      I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

      1. automaton.patch
        19 kB
        Robert Muir
      2. automatonWithWildCard.patch
        36 kB
        Robert Muir
      3. automatonWithWildCard2.patch
        36 kB
        Robert Muir
      4. automatonMultiQuery.patch
        34 kB
        Robert Muir
      5. automatonMultiQuerySmart.patch
        35 kB
        Robert Muir
      6. automatonmultiqueryfuzzy.patch
        47 kB
        Robert Muir
      7. LUCENE-1606.patch
        47 kB
        Robert Muir
      8. LUCENE-1606.patch
        58 kB
        Robert Muir
      9. LUCENE-1606_nodep.patch
        194 kB
        Robert Muir
      10. LUCENE-1606.patch
        192 kB
        Robert Muir
      11. LUCENE-1606.patch
        198 kB
        Uwe Schindler
      12. LUCENE-1606.patch
        199 kB
        Uwe Schindler
      13. LUCENE-1606.patch
        200 kB
        Uwe Schindler
      14. LUCENE-1606.patch
        198 kB
        Robert Muir
      15. LUCENE-1606.patch
        198 kB
        Robert Muir
      16. BenchWildcard.java
        4 kB
        Robert Muir
      17. LUCENE-1606-flex.patch
        197 kB
        Michael McCandless
      18. LUCENE-1606-flex.patch
        212 kB
        Robert Muir
      19. LUCENE-1606.patch
        208 kB
        Robert Muir
      20. LUCENE-1606.patch
        211 kB
        Robert Muir
      21. LUCENE-1606.patch
        211 kB
        Robert Muir
      22. LUCENE-1606.patch
        204 kB
        Robert Muir
      23. LUCENE-1606.patch
        214 kB
        Robert Muir
      24. LUCENE-1606-flex.patch
        234 kB
        Robert Muir
      25. LUCENE-1606-flex.patch
        276 kB
        Uwe Schindler
      26. LUCENE-1606-flex.patch
        276 kB
        Uwe Schindler
      27. LUCENE-1606-flex.patch
        230 kB
        Uwe Schindler
      28. LUCENE-1606.patch
        213 kB
        Robert Muir
      29. LUCENE-1606.patch
        213 kB
        Robert Muir
      30. LUCENE-1606-flex.patch
        213 kB
        Robert Muir
      31. LUCENE-1606-flex.patch
        213 kB
        Robert Muir
      32. LUCENE-1606-flex.patch
        212 kB
        Robert Muir
      33. LUCENE-1606-flex.patch
        212 kB
        Robert Muir
      34. LUCENE-1606-flex.patch
        216 kB
        Robert Muir
      35. LUCENE-1606-flex.patch
        211 kB
        Uwe Schindler

        Issue Links

          Activity

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development