Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8219

LevenshteinAutomata should estimate the number of states and transitions

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.4, 8.0
    • None
    • None
    • New

    Description

      Currently the toAutomaton() method of the LevenshteinAutomata class uses the default constructor of the Automaton although it exactly knows how many states the automaton will have and can do a reasonable estimation on how many transitions it will need as well.

      I suggest changing the lines

      // the number of states is based on the length of the word and n
      int numStates = description.size();
      
      Automaton a = new Automaton();
      int lastState;
      

      to

      // the number of states is based on the length of the word and n
      final int numStates = description.size();
      final int numTransitions = numStates * Math.min(1 + 2 * n, alphabet.length);
      final int prefixStates = prefix != null ? prefix.codePointCount(0, prefix.length()) : 0;
      
      final Automaton a = new Automaton(numStates + prefixStates, numTransitions);
      int lastState;
      

      For my test data this cut down on the total amount of memory needed for int arrays by factor 4. The estimation of "1 + 2 * editDistance" should maybe rather be replaced by a value coming from the ParametricDescription used.

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            christianz Christian Ziech
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: