Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1606

Automaton Query/Filter (scalable regex)

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

      Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

      Some use cases I envision:
      1. lexicography/etc on large text corpora
      2. looking for things such as urls where the prefix is not constant (http:// or ftp://)

      The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

      The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

      1. Look at the portion that is OK (did not enter a reject state in the DFA)
      2. Generate the next possible String and seek to that.

      the Query simply wraps the filter with ConstantScoreQuery.

      I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

        Attachments

        1. automaton.patch
          19 kB
          Robert Muir
        2. automatonMultiQuery.patch
          34 kB
          Robert Muir
        3. automatonmultiqueryfuzzy.patch
          47 kB
          Robert Muir
        4. automatonMultiQuerySmart.patch
          35 kB
          Robert Muir
        5. automatonWithWildCard.patch
          36 kB
          Robert Muir
        6. automatonWithWildCard2.patch
          36 kB
          Robert Muir
        7. BenchWildcard.java
          4 kB
          Robert Muir
        8. LUCENE-1606_nodep.patch
          194 kB
          Robert Muir
        9. LUCENE-1606.patch
          213 kB
          Robert Muir
        10. LUCENE-1606.patch
          213 kB
          Robert Muir
        11. LUCENE-1606.patch
          214 kB
          Robert Muir
        12. LUCENE-1606.patch
          204 kB
          Robert Muir
        13. LUCENE-1606.patch
          211 kB
          Robert Muir
        14. LUCENE-1606.patch
          211 kB
          Robert Muir
        15. LUCENE-1606.patch
          208 kB
          Robert Muir
        16. LUCENE-1606.patch
          198 kB
          Robert Muir
        17. LUCENE-1606.patch
          198 kB
          Robert Muir
        18. LUCENE-1606.patch
          200 kB
          Uwe Schindler
        19. LUCENE-1606.patch
          199 kB
          Uwe Schindler
        20. LUCENE-1606.patch
          198 kB
          Uwe Schindler
        21. LUCENE-1606.patch
          192 kB
          Robert Muir
        22. LUCENE-1606.patch
          58 kB
          Robert Muir
        23. LUCENE-1606.patch
          47 kB
          Robert Muir
        24. LUCENE-1606-flex.patch
          211 kB
          Uwe Schindler
        25. LUCENE-1606-flex.patch
          216 kB
          Robert Muir
        26. LUCENE-1606-flex.patch
          212 kB
          Robert Muir
        27. LUCENE-1606-flex.patch
          212 kB
          Robert Muir
        28. LUCENE-1606-flex.patch
          213 kB
          Robert Muir
        29. LUCENE-1606-flex.patch
          213 kB
          Robert Muir
        30. LUCENE-1606-flex.patch
          230 kB
          Uwe Schindler
        31. LUCENE-1606-flex.patch
          276 kB
          Uwe Schindler
        32. LUCENE-1606-flex.patch
          276 kB
          Uwe Schindler
        33. LUCENE-1606-flex.patch
          234 kB
          Robert Muir
        34. LUCENE-1606-flex.patch
          212 kB
          Robert Muir
        35. LUCENE-1606-flex.patch
          197 kB
          Michael McCandless

          Issue Links

            Activity

              People

              • Assignee:
                rcmuir Robert Muir
                Reporter:
                rcmuir Robert Muir
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: