Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8219

LevenshteinAutomata should estimate the number of states and transitions

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.4, 8.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently the toAutomaton() method of the LevenshteinAutomata class uses the default constructor of the Automaton although it exactly knows how many states the automaton will have and can do a reasonable estimation on how many transitions it will need as well.

      I suggest changing the lines

      // the number of states is based on the length of the word and n
      int numStates = description.size();
      
      Automaton a = new Automaton();
      int lastState;
      

      to

      // the number of states is based on the length of the word and n
      final int numStates = description.size();
      final int numTransitions = numStates * Math.min(1 + 2 * n, alphabet.length);
      final int prefixStates = prefix != null ? prefix.codePointCount(0, prefix.length()) : 0;
      
      final Automaton a = new Automaton(numStates + prefixStates, numTransitions);
      int lastState;
      

      For my test data this cut down on the total amount of memory needed for int arrays by factor 4. The estimation of "1 + 2 * editDistance" should maybe rather be replaced by a value coming from the ParametricDescription used.

        Attachments

          Activity

            People

            • Assignee:
              kwright@metacarta.com Karl Wright
              Reporter:
              christianz Christian Ziech
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: