Uploaded image for project: 'Commons Text'
  1. Commons Text
  2. TEXT-155

Add a generic OverlapSimilarity measure

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Minor
    • Resolution: Implemented
    • 1.6
    • 1.7
    • None

    Description

      The SimilarityScore<T> interface can be used to compute a generic result. I propose to add a class that can compute the intersection between two sets formed from the characters. The sets must be formed from the CharSequence input to the apply method using a Function<CharSequence, Set<T>> to convert the CharSequence. This function can be passed to the SimilarityScore<T> during construction.

      The result can then be computed to have the size of each set and the intersection.

      I have created an implementation that can compute the equivalent of the JaccardSimilary class by creating Set<Character> and also the F1-score using bigrams (pairs of characters) by creating Set<String>. This relates to Text-126 which suggested an algorithm for the Sorensen-Dice similarity, also known as the F1-score.

      Here is an example:

      // Match the functionality of the JaccardSimilarity class
      Function<CharSequence, Set<Character>> converter = (cs) -> {
          final Set<Character> set = new HashSet<>();
          for (int i = 0; i < cs.length(); i++) {
              set.add(cs.charAt(i));
          }
          return set;
      };
      IntersectionSimilarity<Character> similarity = new IntersectionSimilarity<>(converter);
      IntersectionResult result = similarity.apply("something", "something else");
      

      The result has the size of set A, set B and the intersection between them.

      This class was inspired by my look through the various similarity implementations. All of them except the CosineSimilarity perform single character matching between the input CharSequence}}s. The {{CosineSimilarity tokenises using whitespace to create words.

      This more generic type of implementation will allow a user to determine how to divide the CharSequence but to create the sets that are compared, e.g. single characters, words, bigrams, etc.

      Attachments

        Issue Links

          Activity

            People

              aherbert Alex Herbert
              aherbert Alex Herbert
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5.5h
                  5.5h