Lucene - Core
  1. Lucene - Core
  2. LUCENE-5628

SpecialOperations.getFiniteStrings should not recurse

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.8.1, 4.9, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today it consumes one Java stack frame per transition, which when used by AnalyzingSuggester is per character in each token. This can lead to stack overflows if you have a long suggestion.

      1. LUCENE-5628.patch
        17 kB
        Michael McCandless
      2. LUCENE-5628.patch
        17 kB
        Michael McCandless
      3. LUCENE-5628.patch
        17 kB
        Michael McCandless
      4. LUCENE-5628.patch
        13 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch. The method is more hairy than before ... but I think it's working. I added some more tests for getFiniteStrings.

        Show
        Michael McCandless added a comment - Patch. The method is more hairy than before ... but I think it's working. I added some more tests for getFiniteStrings.
        Hide
        Simon Willnauer added a comment -

        I think it's worth doing the optimization here. A couple of comments

        • can we put the exit condition into the while block instead of at the end with a break I think it can just be while(string.length > 0)
        • looking at the impl of State I think we can just use an identity hashset or maybe even an array since the Ids are within known bounds to check the pathStates? You could even just us a bitset and mark the state ID as visited? Hmm now that I wrote it I see your comment I will leave it here for dicsussion.
        • Somewhat unrelated but I think the State implementation has a problem since it doen't override equlas but it should since it has an hashcode impl. I wonder if we either should remove the hashCode or add equals just for consistency?
        • should we rather throw IllegalState than IllegalArgument
        • just for readability it might be good to s/strings/finiteStrings/ I had a hard time to see when you do things on the string vs. strings
        • is this a leftover ==> // a.getNumberedStates();
        Show
        Simon Willnauer added a comment - I think it's worth doing the optimization here. A couple of comments can we put the exit condition into the while block instead of at the end with a break I think it can just be while(string.length > 0) looking at the impl of State I think we can just use an identity hashset or maybe even an array since the Ids are within known bounds to check the pathStates? You could even just us a bitset and mark the state ID as visited? Hmm now that I wrote it I see your comment I will leave it here for dicsussion. Somewhat unrelated but I think the State implementation has a problem since it doen't override equlas but it should since it has an hashcode impl. I wonder if we either should remove the hashCode or add equals just for consistency? should we rather throw IllegalState than IllegalArgument just for readability it might be good to s/strings/finiteStrings/ I had a hard time to see when you do things on the string vs. strings is this a leftover ==> // a.getNumberedStates();
        Hide
        Robert Muir added a comment -

        Can we reduce the number of lines of code in the new method? It's not even comparable to the current code. How much of the loc is the cycle detection? Given the cost, this may not be worth it. This is expert shit and the user can add assert.isfinite to their code. How much of the loc is code optimization? Can the old code please be added to automatontestutil as slowxxx and compared against the new one with random automata?

        Show
        Robert Muir added a comment - Can we reduce the number of lines of code in the new method? It's not even comparable to the current code. How much of the loc is the cycle detection? Given the cost, this may not be worth it. This is expert shit and the user can add assert.isfinite to their code. How much of the loc is code optimization? Can the old code please be added to automatontestutil as slowxxx and compared against the new one with random automata?
        Hide
        Robert Muir added a comment -

        If we want to baby the user, and in not sure what user we have in mind here, just invoke isfinite. I don't like the code dup nor the precedence that unrelated code needs to deal with this.

        This thing needs to get much shorter and simpler to go in. Can we make it slower to achieve that? I would make it 10 times slower if it removed 1/2 the code... Without hesitation.

        Show
        Robert Muir added a comment - If we want to baby the user, and in not sure what user we have in mind here, just invoke isfinite. I don't like the code dup nor the precedence that unrelated code needs to deal with this. This thing needs to get much shorter and simpler to go in. Can we make it slower to achieve that? I would make it 10 times slower if it removed 1/2 the code... Without hesitation.
        Hide
        Michael McCandless added a comment -

        Good feedback, thanks!

        can we put the exit condition into the while block instead of at the end with a break I think it can just be while(string.length > 0)

        Fixed.

        looking at the impl of State I think we can just use an identity hashset or maybe even an array since the Ids are within known bounds to check the pathStates? You could even just us a bitset and mark the state ID as visited? Hmm now that I wrote it I see your comment I will leave it here for dicsussion.

        I switched to IdentityHashSet.

        Yeah I struggled w/ this, but the original method didn't set the state
        numbers so I didn't want to change that. Setting the numbers does a
        DFS on the automaton...

        Somewhat unrelated but I think the State implementation has a problem since it doen't override equlas but it should since it has an hashcode impl. I wonder if we either should remove the hashCode or add equals just for consistency?

        I removed State.hashCode.

        should we rather throw IllegalState than IllegalArgument

        Hmm, IAE felt right since you passed it an invalid (cyclic) argument?

        just for readability it might be good to s/strings/finiteStrings/ I had a hard time to see when you do things on the string vs. strings

        I changed to results.

        is this a leftover ==> // a.getNumberedStates();

        Removed.

        Show
        Michael McCandless added a comment - Good feedback, thanks! can we put the exit condition into the while block instead of at the end with a break I think it can just be while(string.length > 0) Fixed. looking at the impl of State I think we can just use an identity hashset or maybe even an array since the Ids are within known bounds to check the pathStates? You could even just us a bitset and mark the state ID as visited? Hmm now that I wrote it I see your comment I will leave it here for dicsussion. I switched to IdentityHashSet. Yeah I struggled w/ this, but the original method didn't set the state numbers so I didn't want to change that. Setting the numbers does a DFS on the automaton... Somewhat unrelated but I think the State implementation has a problem since it doen't override equlas but it should since it has an hashcode impl. I wonder if we either should remove the hashCode or add equals just for consistency? I removed State.hashCode. should we rather throw IllegalState than IllegalArgument Hmm, IAE felt right since you passed it an invalid (cyclic) argument? just for readability it might be good to s/strings/finiteStrings/ I had a hard time to see when you do things on the string vs. strings I changed to results. is this a leftover ==> // a.getNumberedStates(); Removed.
        Hide
        Michael McCandless added a comment -

        Can we reduce the number of lines of code in the new method? It's not even comparable to the current code.

        I'll see if I can simplify it somehow ...

        How much of the loc is the cycle detection?

        This is really a miniscule part of it: just look for whoever touches
        the pathStates.

        How much of the loc is code optimization?

        It's not optimization; in fact I imagine this impl is slower.

        Can the old code please be added to automatontestutil as slowxxx and compared against the new one with random automata?

        Great idea, I did that.

        Show
        Michael McCandless added a comment - Can we reduce the number of lines of code in the new method? It's not even comparable to the current code. I'll see if I can simplify it somehow ... How much of the loc is the cycle detection? This is really a miniscule part of it: just look for whoever touches the pathStates. How much of the loc is code optimization? It's not optimization; in fact I imagine this impl is slower. Can the old code please be added to automatontestutil as slowxxx and compared against the new one with random automata? Great idea, I did that.
        Hide
        Michael McCandless added a comment -

        New patch.

        Show
        Michael McCandless added a comment - New patch.
        Hide
        Michael McCandless added a comment -

        New patch, with some simplification: I moved all the hairy logic about next label/transition into the PathNode. I think this helps.

        I put a nocommit to use Stack instead of PathNode[] ... this would be simpler (push/pop instead of .get/.remove) ... the only backside is it would mean new Java object on each push vs now where it re-uses.

        Show
        Michael McCandless added a comment - New patch, with some simplification: I moved all the hairy logic about next label/transition into the PathNode. I think this helps. I put a nocommit to use Stack instead of PathNode[] ... this would be simpler (push/pop instead of .get/.remove) ... the only backside is it would mean new Java object on each push vs now where it re-uses.
        Hide
        Michael McCandless added a comment -

        New patch, just adding some comments and removing the nocommit. I think it's ready ... I plan to commit soon.

        Show
        Michael McCandless added a comment - New patch, just adding some comments and removing the nocommit. I think it's ready ... I plan to commit soon.
        Hide
        Robert Muir added a comment -

        OK i agree this is simpler.

        Can we remove the "a.removeDeadTransitions()" in the test? If this is really needed for the method to work, i think it should call it...

        Show
        Robert Muir added a comment - OK i agree this is simpler. Can we remove the "a.removeDeadTransitions()" in the test? If this is really needed for the method to work, i think it should call it...
        Hide
        ASF subversion and git services added a comment -

        Commit 1592362 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1592362 ]

        LUCENE-5628: change getFiniteStrings to iterative impl

        Show
        ASF subversion and git services added a comment - Commit 1592362 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1592362 ] LUCENE-5628 : change getFiniteStrings to iterative impl
        Hide
        ASF subversion and git services added a comment -

        Commit 1592363 from Michael McCandless in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1592363 ]

        LUCENE-5628: change getFiniteStrings to iterative impl

        Show
        ASF subversion and git services added a comment - Commit 1592363 from Michael McCandless in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1592363 ] LUCENE-5628 : change getFiniteStrings to iterative impl
        Hide
        ASF subversion and git services added a comment -

        Commit 1592584 from Michael McCandless in branch 'dev/branches/lucene_solr_4_8'
        [ https://svn.apache.org/r1592584 ]

        LUCENE-5599, LUCENE-5600, LUCENE-5628: backport to 4.8.x

        Show
        ASF subversion and git services added a comment - Commit 1592584 from Michael McCandless in branch 'dev/branches/lucene_solr_4_8' [ https://svn.apache.org/r1592584 ] LUCENE-5599 , LUCENE-5600 , LUCENE-5628 : backport to 4.8.x

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development