Details
-
Improvement
-
Status: Closed
-
Trivial
-
Resolution: Implemented
-
1.6
-
None
Description
The JaccardSimilarity uses rounding to 2 decimal places. This prevents ranking of dissimilar sequences of even moderately short length.
Using sequences with 1 or 2 characters in common and the remaining characters are different:
2 0.500000 1.000000 : aa vs (ab or aa) 3 0.250000 0.330000 : aaD vs (abd or aaÀ) 4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ) 5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ) 6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ) 7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ) 8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ) 9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ) 10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
Without rounding the scores are different where previously rounding had produced the same score. This will improve ranking:
2 0.500000 1.000000 : aa vs (ab or aa) 3 0.250000 0.333333 : aaD vs (abd or aaÀ) 4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ) 5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ) 6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ) 7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ) 8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ) 9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ) 10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
Generated using:
@Test public void roundingDemo() { // First character of each dissimilar sequence. // Chosen for a nice output where we already know the loop // will exit before sequence overlap. char ch1 = 'D'; char ch2 = 'd'; char ch3 = 0x00c0; // 1 or 2 characters in common StringBuilder sb1 = new StringBuilder("aa"); StringBuilder sb2 = new StringBuilder("ab"); // 1 in common StringBuilder sb3 = new StringBuilder("aa"); // 2 in common JaccardSimilarity similarity = new JaccardSimilarity(); // Extend the sequences until a single/double character // similarity cannot be detected double j1, j2; do { j1 = similarity.apply(sb1, sb2); j2 = similarity.apply(sb1, sb3); System.out.printf("%2d %f %f : %s vs (%s or %s)%n", sb1.length(), j1, j2, sb1, sb2, sb3); // Extend the sequence using unique characters for each sb1.append(ch1++); sb2.append(ch2++); sb3.append(ch3++); // Note: Check length since the sequences will overlap // in case the rounding is not present } while (j1 != j2 && sb1.length() < 26); }
Attachments
Issue Links
- links to