Uploaded image for project: 'Commons Text'
  1. Commons Text
  2. TEXT-131

JaroWinklerDistance: Calculation deviates from definition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.5
    • None

    Description

      The calculation in JaroWinklerDistance deviates from the definition of the Jaro-Winkler Similarity. By definition the common prefix length is only determine for the first 4 characters. Further, the JaroWinkler is defined as JaroSimilarity + ScalingFactor * CommonPrefixLength * (1 - JaroSimilarity ).
      Therefore, I recommend the following changes:

      1. Update Jaro-Winkler Similarity calculation
        final double jw = j < 0.7D ? j : j + Math.min(defaultScalingFactor, 1D / mtp[3]) * mtp[2] * (1D - j);
        

        to

        final double jw = j < 0.7D ? j : j + defaultScalingFactor * mtp[2] * (1D - j);
        
      1. Update calculation of Common Prefix Length
        for (int mi = 0; mi < min.length(); mi++) {
        

        to

        for (int mi = 0; mi < Math.min(4, min.length()); mi++) {
        
      1. Remove unnecessary return value
        return new int[] {matches, transpositions, prefix, max.length()};
        

        to

        return new int[] {matches, transpositions, prefix};
        

      Attachments

        Activity

          People

            chtompki Rob Tompkins
            jmkeil Jan Martin Keil
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: