Uploaded image for project: 'Commons Text'
  1. Commons Text
  2. TEXT-158

Incorrect values for Jaccard similarity with empty strings

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.6, 1.9
    • 1.10.0
    • None

    Description

      In a discussion part of TEXT-126, it was pointed that the Jaccard similarity returns 0.0, and the distance 1.0. While in other libraries it returns the opposite for each.

      package br.eti.kinoshita.tests.text;
      
      import java.util.Collections;
      
      public class EditDistances {
      
          public static void main(String[] args) {
              System.out.println("Testing jaccard sim/dis with empty strings");
              System.out.println("---");
              org.simmetrics.metrics.Jaccard<String> j1 = new org.simmetrics.metrics.Jaccard<>();
              float s1 = j1.compare(Collections.emptySet(), Collections.emptySet());
              System.out.println("Simmetrics Jaccard similarity: " + s1);
              float d1 = j1.distance(Collections.emptySet(), Collections.emptySet());
              System.out.println("Simmetrics Jaccard distance: " + d1);
              
              System.out.println("---");
              
              info.debatty.java.stringsimilarity.Jaccard j2 = new info.debatty.java.stringsimilarity.Jaccard();
              double s2 = j2.similarity("", "");
              System.out.println("javastringsimilarity Jaccard similarity: " + s2);
              double d2 = j2.distance("", "");
              System.out.println("javastringsimilarity Jaccard distance: " + d2);
              
              System.out.println("---");
              
              org.apache.commons.text.similarity.JaccardSimilarity j3_1 = new org.apache.commons.text.similarity.JaccardSimilarity();
              double s3 = j3_1.apply("", "");
              System.out.println("commons-text Jaccard similarity: " + s3);
              org.apache.commons.text.similarity.JaccardDistance j3_2 = new org.apache.commons.text.similarity.JaccardDistance();
              double d3 = j3_2.apply("", "");
              System.out.println("commons-text Jaccard distance: " + d3);
          }
      }

      Produces:

      Testing jaccard sim/dis with empty strings
      ---
      Simmetrics Jaccard similarity: 1.0
      Simmetrics Jaccard distance: 0.0
      ---
      javastringsimilarity Jaccard similarity: 1.0
      javastringsimilarity Jaccard distance: 0.0
      ---
      commons-text Jaccard similarity: 0.0
      commons-text Jaccard distance: 1.0

      We need to confirm what's the correct output for similarity and distance with empty strings. And either document why we are returning what we are returning, or fix it as a bug for the next release.

      Attachments

        Issue Links

          Activity

            People

              kinow Bruno P. Kinoshita
              kinow Bruno P. Kinoshita
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h