Uploaded image for project: 'Apache AsterixDB'
  1. Apache AsterixDB
  2. ASTERIXDB-1778

Calculating the edit-distance in SimilarityMetricEditDistance class can be improved.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Local optimization of the current edit-distance handling code in AsterixDB: Can we terminate the edit distance calculation based on a given edit-distance threshold T? I think the answer is yes.

      Basic Algorithm:
      For two strings X and Y,

      Init:
      Construct D(len(X)+1=M, len(Y)+1=N).
      D(i,0) = i
      D(j,0) = j

      Iteration:

      For each i = 1 to M
          For each j = 1 to N
              D(i,j) = min( D(i-1,j) + 1 // deletion case
                            D(i,j-1) + 1 // insertion case
                            D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j), 
                                                       cost is 0 if X(i) = Y(j))
                          )
      

      Result:
      D(M,N) is the edit distance value.

      Example:

      Early Termination condition:
      So, for the given threshold T, we can stop the computation early if the values of three cases are greater than T. That is,

      min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T

      This holds since if the cost of all possible cases (insertion, deletion, and substitution) is greater than T, all future operations will be greater than T in any cases.

        Attachments

          Activity

            People

            • Assignee:
              wangsaeu Taewoo Kim
              Reporter:
              wangsaeu Taewoo Kim
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: