Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.4
-
None
Description
The calculation in JaroWinklerDistance deviates from the definition of the Jaro-Winkler Similarity. By definition the common prefix length is only determine for the first 4 characters. Further, the JaroWinkler is defined as JaroSimilarity + ScalingFactor * CommonPrefixLength * (1 - JaroSimilarity ).
Therefore, I recommend the following changes:
- Update Jaro-Winkler Similarity calculation
final double jw = j < 0.7D ? j : j + Math.min(defaultScalingFactor, 1D / mtp[3]) * mtp[2] * (1D - j);
to
final double jw = j < 0.7D ? j : j + defaultScalingFactor * mtp[2] * (1D - j);
- Update calculation of Common Prefix Length
for (int mi = 0; mi < min.length(); mi++) {
to
for (int mi = 0; mi < Math.min(4, min.length()); mi++) {
- Remove unnecessary return value
return new int[] {matches, transpositions, prefix, max.length()};
to
return new int[] {matches, transpositions, prefix};