Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.10.1
    • Fix Version/s: 4.10.3, 5.0, 6.0
    • Component/s: core/queryparser
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When creating an automaton from an org.apache.lucene.util.automaton.RegExp, it's possible for the automaton to use so much memory it exceeds the maximum array size for java.

      The following caused an OutOfMemoryError with a 32gb heap:

      new RegExp("\\[\\[(Datei|File|Bild|Image):[^]]*alt=[^]|}]{50,200}").toAutomaton();
      

      When increased to a 60gb heap, the following exception is thrown:

        1> java.lang.IllegalArgumentException: requested array size 2147483624 exceeds maximum array in java (2147483623)
        1>     __randomizedtesting.SeedInfo.seed([7BE81EF678615C32:95C8057A4ABA5B52]:0)
        1>     org.apache.lucene.util.ArrayUtil.oversize(ArrayUtil.java:168)
        1>     org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:295)
        1>     org.apache.lucene.util.automaton.Automaton$Builder.addTransition(Automaton.java:639)
        1>     org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741)
        1>     org.apache.lucene.util.automaton.MinimizationOperations.minimizeHopcroft(MinimizationOperations.java:62)
        1>     org.apache.lucene.util.automaton.MinimizationOperations.minimize(MinimizationOperations.java:51)
        1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:477)
        1>     org.apache.lucene.util.automaton.RegExp.toAutomaton(RegExp.java:426)
      
      1. LUCENE-6046.patch
        90 kB
        Nik Everett
      2. LUCENE-6046.patch
        63 kB
        Michael McCandless
      3. LUCENE-6046.patch
        71 kB
        Nik Everett

        Activity

        Hide
        Michael McCandless added a comment -

        Hmm, two bugs here.

        First off, RegExp.toAutomaton is an inherently costly method: wasteful of RAM and CPU, doing minimize after each recursive operation, in order to build a DFA in the end. It's unfortunately quite easy to concoct regular expressions that make it consume ridiculous resources. I'll look at this example and see if we can improve it, but in the end it will always have its "adversarial cases" unless we give up on making the resulting automaton deterministic, which would be a very big change.

        Maybe we should add adversary defenses to it, e.g. you set a limit on the number of states it's allowed to create, and it throws a RegExpTooHardException if it would exceed that?

        Second off, ArrayUtil.oversize has the wrong (too large) value for MAX_ARRAY_LENGTH, which is a bug from LUCENE-5844. Which JVM did you run this test on?

        Show
        Michael McCandless added a comment - Hmm, two bugs here. First off, RegExp.toAutomaton is an inherently costly method: wasteful of RAM and CPU, doing minimize after each recursive operation, in order to build a DFA in the end. It's unfortunately quite easy to concoct regular expressions that make it consume ridiculous resources. I'll look at this example and see if we can improve it, but in the end it will always have its "adversarial cases" unless we give up on making the resulting automaton deterministic, which would be a very big change. Maybe we should add adversary defenses to it, e.g. you set a limit on the number of states it's allowed to create, and it throws a RegExpTooHardException if it would exceed that? Second off, ArrayUtil.oversize has the wrong (too large) value for MAX_ARRAY_LENGTH, which is a bug from LUCENE-5844 . Which JVM did you run this test on?
        Hide
        Dawid Weiss added a comment -

        Just a note – Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners.
        http://swtch.com/~rsc/regexp/regexp1.html

        (There is no clear winner – both DFAs and NFA have advantages.)

        Show
        Dawid Weiss added a comment - Just a note – Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners. http://swtch.com/~rsc/regexp/regexp1.html (There is no clear winner – both DFAs and NFA have advantages.)
        Hide
        Lee Hinman added a comment -

        Michael McCandless I ran it with the following JVM:

        java version "1.8.0_20"
        Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
        Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
        
        Show
        Lee Hinman added a comment - Michael McCandless I ran it with the following JVM: java version "1.8.0_20" Java(TM) SE Runtime Environment (build 1.8.0_20-b26) Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
        Hide
        Michael McCandless added a comment -

        Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners.

        Thanks Dawid, these are great.

        Switching to NFA based matching would be a very large change ... I don't think we should pursue it here. Terms.intersect implementation for block tree is already very complex ... though I suppose of we could hide the "on the fly subset construction" (and convert regexp to a Thompson NFA) under an API, then Terms.intersect implementation wouldn't have to change much.

        Still, there will always be adversarial cases no matter which approach we choose. I think for this issue we should allow passing in a "how much work are you willing to do" to RegExp.toAutomaton, and it throws an exc when it would exceed that.

        Show
        Michael McCandless added a comment - Russ Cox wrote a series of excellent articles about different approaches of implementing regexp scanners. Thanks Dawid, these are great. Switching to NFA based matching would be a very large change ... I don't think we should pursue it here. Terms.intersect implementation for block tree is already very complex ... though I suppose of we could hide the "on the fly subset construction" (and convert regexp to a Thompson NFA) under an API, then Terms.intersect implementation wouldn't have to change much. Still, there will always be adversarial cases no matter which approach we choose. I think for this issue we should allow passing in a "how much work are you willing to do" to RegExp.toAutomaton, and it throws an exc when it would exceed that.
        Hide
        Michael McCandless added a comment -

        Michael McCandless I ran it with the following JVM:

        Thanks Lee Hinman.

        I was wrong about the first bug: there is no bug in ArrayUtil.oversize. That exception just means RegExp is trying to create a too-big array ... so just the one bug

        Show
        Michael McCandless added a comment - Michael McCandless I ran it with the following JVM: Thanks Lee Hinman . I was wrong about the first bug: there is no bug in ArrayUtil.oversize. That exception just means RegExp is trying to create a too-big array ... so just the one bug
        Hide
        Dawid Weiss added a comment -

        I didn't mean to imply we should change the regexp implementation! This was just a pointer in case somebody wished to understand why regexps can explode. I actually wish there was an NFA-based regexp implementation for Java (with low-memory footprint) because this would make concatenating thousands of regexps (e.g., for pattern detection) much easier.

        Show
        Dawid Weiss added a comment - I didn't mean to imply we should change the regexp implementation! This was just a pointer in case somebody wished to understand why regexps can explode. I actually wish there was an NFA-based regexp implementation for Java (with low-memory footprint) because this would make concatenating thousands of regexps (e.g., for pattern detection) much easier.
        Hide
        Lee Hinman added a comment -

        I think for this issue we should allow passing in a "how much work are you willing to do" to RegExp.toAutomaton, and it throws an exc when it would exceed that.

        For what it's worth, I think this would be a good solution for us, much better than silently (from the user's perspective) freezing and then hitting an OOME.

        Show
        Lee Hinman added a comment - I think for this issue we should allow passing in a "how much work are you willing to do" to RegExp.toAutomaton, and it throws an exc when it would exceed that. For what it's worth, I think this would be a good solution for us, much better than silently (from the user's perspective) freezing and then hitting an OOME.
        Hide
        Nik Everett added a comment -

        I'm working on a first cut of something that does that. Better regex implementation would be great but the biggest thing to me is being able to limit the amount of work the determinize operation performs. Its such a costly operation that I don't think it should ever be really abstracted from the user. Something like having determinize throw a checked exception when it performs too much work would make you keenly aware whenever you might be straying into exponential territory.

        Show
        Nik Everett added a comment - I'm working on a first cut of something that does that. Better regex implementation would be great but the biggest thing to me is being able to limit the amount of work the determinize operation performs. Its such a costly operation that I don't think it should ever be really abstracted from the user. Something like having determinize throw a checked exception when it performs too much work would make you keenly aware whenever you might be straying into exponential territory.
        Hide
        Michael McCandless added a comment -

        OK I boiled down the adversarial regexp to this simpler still-adversarial version: [ac]*a[ac]{50,200}

        I suspect this is a legitimate adversary and not a bug in our RegExp/automaton impl, i.e. the number of states in the DFA for this is exponential as a function of the 50/200.

        Show
        Michael McCandless added a comment - OK I boiled down the adversarial regexp to this simpler still-adversarial version: [ac]*a[ac]{50,200} I suspect this is a legitimate adversary and not a bug in our RegExp/automaton impl, i.e. the number of states in the DFA for this is exponential as a function of the 50/200.
        Hide
        Nik Everett added a comment -

        Oh yeah, its totally running into 2^n territory legitiately here. This is totally something that'd be rejected by a framework to prevent explosive growth during determination.

        Show
        Nik Everett added a comment - Oh yeah, its totally running into 2^n territory legitiately here. This is totally something that'd be rejected by a framework to prevent explosive growth during determination.
        Hide
        Nik Everett added a comment -

        First cut at a patch. Adds maxDeterminizedStates to Operations.determinize and pipes it through to tons of places. I think its important never to hide when determinize is called because of how potentially heavy it is. Forcing callers of MinimizationOperations.minimize, Operations.reverse, Operations.minus etc to specify maxDeterminizedStates makes it pretty clear that the automaton might be determinized during those processes.

        I added an unchecked exception for when the Automaton can't be determinized within the specified number of state but I'm really tempted to change it to a checked exception to make it super duper obvious when determinization might occur.

        Show
        Nik Everett added a comment - First cut at a patch. Adds maxDeterminizedStates to Operations.determinize and pipes it through to tons of places. I think its important never to hide when determinize is called because of how potentially heavy it is. Forcing callers of MinimizationOperations.minimize, Operations.reverse, Operations.minus etc to specify maxDeterminizedStates makes it pretty clear that the automaton might be determinized during those processes. I added an unchecked exception for when the Automaton can't be determinized within the specified number of state but I'm really tempted to change it to a checked exception to make it super duper obvious when determinization might occur.
        Hide
        Nik Everett added a comment -

        Oh - I'm still running the solr tests against this. I imagine they'll pass as they've been running fine for 30 minutes now but I should throw that out there in case someone gets them to fail with this before I do.

        Show
        Nik Everett added a comment - Oh - I'm still running the solr tests against this. I imagine they'll pass as they've been running fine for 30 minutes now but I should throw that out there in case someone gets them to fail with this before I do.
        Hide
        Michael McCandless added a comment -

        Patch, tests pass. I added a required "int maxStates" to
        RegExp.toAutomaton, and it threads this down to determinize, and
        throws RegExpTooHardExc if determinize would need to exceed that
        limit.

        I didn't make it a checked exc; I had started that way but it
        percolates up high, e.g. into query parsers, and I think that's too
        much. The exception message itself should make it quite clear what
        went wrong at query time.

        I also added this as an optional param to RegexpQuery default ctor,
        and defaulted it to 10000 states, and to QueryParserBase, with the same
        default.

        Show
        Michael McCandless added a comment - Patch, tests pass. I added a required "int maxStates" to RegExp.toAutomaton, and it threads this down to determinize, and throws RegExpTooHardExc if determinize would need to exceed that limit. I didn't make it a checked exc; I had started that way but it percolates up high, e.g. into query parsers, and I think that's too much. The exception message itself should make it quite clear what went wrong at query time. I also added this as an optional param to RegexpQuery default ctor, and defaulted it to 10000 states, and to QueryParserBase, with the same default.
        Hide
        Nik Everett added a comment -

        Oh no! I wrote a very similar patch! Sorry to duplicate effort there.

        I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done.

        Show
        Nik Everett added a comment - Oh no! I wrote a very similar patch! Sorry to duplicate effort there. I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done.
        Hide
        Michael McCandless added a comment -

        Woops, sorry, I didn't see you had a patch here! Thank you.

        I like your patch: it's good to make all hidden usages of determinize visible. Let's start from your patch and merge anything from mine in? E.g. I think we can collapse minimizeHopcroft into just minimize...

        I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done.

        Whoa, which tests needed 1M max states? I worry about passing a 1M state automaton to minimize...

        Show
        Michael McCandless added a comment - Woops, sorry, I didn't see you had a patch here! Thank you. I like your patch: it's good to make all hidden usages of determinize visible. Let's start from your patch and merge anything from mine in? E.g. I think we can collapse minimizeHopcroft into just minimize... I found that 10,000 states wasn't quite enough to handle some of the tests so I went with 1,000,000 as the default. Its pretty darn huge but it does get the job done. Whoa, which tests needed 1M max states? I worry about passing a 1M state automaton to minimize...
        Hide
        Michael McCandless added a comment -

        I like the test simplifications, and removing dead code from Operations.determinize.

        Can we fix the exc thrown from Regexp to include the offending regular expression, and fix the test to confirm the message contains it? Maybe also add RegExp.toStringTree? I found it very useful while debugging the original regexp

        I think QueryParserBase should also have a set/get for this option?

        Show
        Michael McCandless added a comment - I like the test simplifications, and removing dead code from Operations.determinize. Can we fix the exc thrown from Regexp to include the offending regular expression, and fix the test to confirm the message contains it? Maybe also add RegExp.toStringTree? I found it very useful while debugging the original regexp I think QueryParserBase should also have a set/get for this option?
        Hide
        Nik Everett added a comment -

        TestDeterminizeLexicon wants to make an automata that accepts 5000 random strings. So 10,000 isn't enough states for it. I'll drop the default limit to 10,000 again and just feed a million to that test case.

        Show
        Nik Everett added a comment - TestDeterminizeLexicon wants to make an automata that accepts 5000 random strings. So 10,000 isn't enough states for it. I'll drop the default limit to 10,000 again and just feed a million to that test case.
        Hide
        Nik Everett added a comment -

        I'll certainly add the regexp string to the exception message. And I'll merge the toStringTree from your patch into mine if you'd like.

        Yeah - QueryParserBase should have this option too.

        The thing I found most useful for debugging this was to call toDot on the automata before and after normalization. I just looked at it and went, oh, of course you have to do it that way. No wonder the states explode. And then I read https://en.wikipedia.org/wiki/Powerset_construction and remembered it from my rusty CS degree.

        Show
        Nik Everett added a comment - I'll certainly add the regexp string to the exception message. And I'll merge the toStringTree from your patch into mine if you'd like. Yeah - QueryParserBase should have this option too. The thing I found most useful for debugging this was to call toDot on the automata before and after normalization. I just looked at it and went, oh, of course you have to do it that way. No wonder the states explode. And then I read https://en.wikipedia.org/wiki/Powerset_construction and remembered it from my rusty CS degree.
        Hide
        Nik Everett added a comment -

        Next version with fixes based on Mike's feedback.

        Show
        Nik Everett added a comment - Next version with fixes based on Mike's feedback.
        Hide
        Nik Everett added a comment -

        A couple of updates:
        This affects version 4.9 as well. Probably all versions. But its impact is likely minor enough to only be worth adding to the 4.10 line.

        A found a few test cases that need lots and lots of states. Any time you feed a couple hundred random unicode words to the automata you'll end up needing more than ten thousand states. I've updated those tests to ask for a million states and they caught a few places where I hadn't been as diligent in piping maxDeterminizedStates through.

        Show
        Nik Everett added a comment - A couple of updates: This affects version 4.9 as well. Probably all versions. But its impact is likely minor enough to only be worth adding to the 4.10 line. A found a few test cases that need lots and lots of states. Any time you feed a couple hundred random unicode words to the automata you'll end up needing more than ten thousand states. I've updated those tests to ask for a million states and they caught a few places where I hadn't been as diligent in piping maxDeterminizedStates through.
        Hide
        Michael McCandless added a comment -

        Thanks Nik, your patch looks great; I'll fold in some more minor things from my patch and commit!

        Show
        Michael McCandless added a comment - Thanks Nik, your patch looks great; I'll fold in some more minor things from my patch and commit!
        Hide
        ASF subversion and git services added a comment -

        Commit 1636716 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1636716 ]

        LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize

        Show
        ASF subversion and git services added a comment - Commit 1636716 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1636716 ] LUCENE-6046 : add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize
        Hide
        ASF subversion and git services added a comment -

        Commit 1636728 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1636728 ]

        LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize

        Show
        ASF subversion and git services added a comment - Commit 1636728 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636728 ] LUCENE-6046 : add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize
        Hide
        ASF subversion and git services added a comment -

        Commit 1636758 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1636758 ]

        LUCENE-6046: fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too

        Show
        ASF subversion and git services added a comment - Commit 1636758 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1636758 ] LUCENE-6046 : fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too
        Hide
        ASF subversion and git services added a comment -

        Commit 1636759 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1636759 ]

        LUCENE-6046: fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too

        Show
        ASF subversion and git services added a comment - Commit 1636759 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636759 ] LUCENE-6046 : fix test failure, add maxDeterminizedStates to AutomatonQuery and WildcardQuery too
        Hide
        ASF subversion and git services added a comment -

        Commit 1636762 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1636762 ]

        LUCENE-6046: add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize

        Show
        ASF subversion and git services added a comment - Commit 1636762 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1636762 ] LUCENE-6046 : add maxDeterminizedStates to determinize to prevent exhausting CPU/RAM when the automaton is too difficult to determinize
        Hide
        Michael McCandless added a comment -

        Thanks Nik!

        Show
        Michael McCandless added a comment - Thanks Nik!
        Hide
        ASF subversion and git services added a comment -

        Commit 1636830 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1636830 ]

        LUCENE-6046: let this test determinize massive automata

        Show
        ASF subversion and git services added a comment - Commit 1636830 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1636830 ] LUCENE-6046 : let this test determinize massive automata
        Hide
        ASF subversion and git services added a comment -

        Commit 1636831 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1636831 ]

        LUCENE-6046: let this test determinize massive automata

        Show
        ASF subversion and git services added a comment - Commit 1636831 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636831 ] LUCENE-6046 : let this test determinize massive automata
        Hide
        ASF subversion and git services added a comment -

        Commit 1636832 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1636832 ]

        LUCENE-6046: let this test determinize massive automata

        Show
        ASF subversion and git services added a comment - Commit 1636832 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1636832 ] LUCENE-6046 : let this test determinize massive automata
        Hide
        ASF subversion and git services added a comment -

        Commit 1637054 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1637054 ]

        LUCENE-6046: let this test determinize massive automata

        Show
        ASF subversion and git services added a comment - Commit 1637054 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1637054 ] LUCENE-6046 : let this test determinize massive automata
        Hide
        ASF subversion and git services added a comment -

        Commit 1637055 from Robert Muir in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1637055 ]

        LUCENE-6046: let this test determinize massive automata

        Show
        ASF subversion and git services added a comment - Commit 1637055 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1637055 ] LUCENE-6046 : let this test determinize massive automata
        Hide
        ASF subversion and git services added a comment -

        Commit 1637056 from Robert Muir in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1637056 ]

        LUCENE-6046: let this test determinize massive automata

        Show
        ASF subversion and git services added a comment - Commit 1637056 from Robert Muir in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1637056 ] LUCENE-6046 : let this test determinize massive automata
        Hide
        ASF subversion and git services added a comment -

        Commit 1637078 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1637078 ]

        LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish

        Show
        ASF subversion and git services added a comment - Commit 1637078 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1637078 ] LUCENE-6046 : remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish
        Hide
        ASF subversion and git services added a comment -

        Commit 1637080 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1637080 ]

        LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish

        Show
        ASF subversion and git services added a comment - Commit 1637080 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1637080 ] LUCENE-6046 : remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish
        Hide
        ASF subversion and git services added a comment -

        Commit 1637082 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1637082 ]

        LUCENE-6046: remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish

        Show
        ASF subversion and git services added a comment - Commit 1637082 from Michael McCandless in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1637082 ] LUCENE-6046 : remove det state limit for all AutomatonTestUtil.randomAutomaton since they can become biggish
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Lee Hinman
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development