Uploaded image for project: 'Commons Text'
  1. Commons Text
  2. TEXT-157

Remove rounding from JaccardSimilarity and Distance to improve ranking

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Trivial
    • Resolution: Implemented
    • 1.6
    • 1.7
    • None

    Description

      The JaccardSimilarity uses rounding to 2 decimal places. This prevents ranking of dissimilar sequences of even moderately short length.

      Using sequences with 1 or 2 characters in common and the remaining characters are different:

       2 0.500000 1.000000 : aa vs (ab or aa)
       3 0.250000 0.330000 : aaD vs (abd or aaÀ)
       4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
       5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
       6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
       7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
       8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
       9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
      10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
      

      Without rounding the scores are different where previously rounding had produced the same score. This will improve ranking:

       2 0.500000 1.000000 : aa vs (ab or aa)
       3 0.250000 0.333333 : aaD vs (abd or aaÀ)
       4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
       5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
       6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
       7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
       8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
       9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
      10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
      

       Generated using:

      @Test
      public void roundingDemo() {
          // First character of each dissimilar sequence.
          // Chosen for a nice output where we already know the loop
          // will exit before sequence overlap.
          char ch1 = 'D';
          char ch2 = 'd';
          char ch3 = 0x00c0;
          // 1 or 2 characters in common
          StringBuilder sb1 = new StringBuilder("aa");
          StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
          StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
          JaccardSimilarity similarity = new JaccardSimilarity();
          // Extend the sequences until a single/double character 
          // similarity cannot be detected
          double j1, j2;
          do  {
              j1 = similarity.apply(sb1, sb2);
              j2 = similarity.apply(sb1, sb3);
              System.out.printf("%2d %f %f : %s vs (%s or %s)%n", 
                                sb1.length(), j1, j2, sb1, sb2, sb3);
              // Extend the sequence using unique characters for each
              sb1.append(ch1++);
              sb2.append(ch2++);
              sb3.append(ch3++);
      
              // Note: Check length since the sequences will overlap
              // in case the rounding is not present
          } while (j1 != j2 && sb1.length() < 26); 
      }
      
      

      Attachments

        Issue Links

          Activity

            People

              aherbert Alex Herbert
              aherbert Alex Herbert
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m