Uploaded image for project: 'Commons Codec'
  1. Commons Codec
  2. CODEC-84

Double Metaphone bugs in alternative encoding

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.3
    • 1.4
    • None

    Description

      The new test case (CODEC-83) has highlighted a number of issues with the "alternative" encoding in the Double Metaphone implementation

      1) Bug in the handleG method when "G" is followed by "IER"

      • The alternative encoding of "Angier" results in "ANKR" rather than "ANJR"
      • The alternative encoding of "rogier" results in "RKR" rather than "RJR"

      The problem is in the handleG() method and is caused by the wrong length (4 instead of 3) being used in the contains() method:

       } else if (contains(value, index + 1, 4, "IER")) {
      

      ...this should be

       } else if (contains(value, index + 1, 3, "IER")) {
      

      2) Bug in the handleL method

      • The alternative encoding of "cabrillo" results in "KPRL " rather than "KPR"

      The problem is that the first thing this method does is append an "L" to both primary & alternative encoding. When the conditionL0() method returns true then the "L" should not be appended for the alternative encoding

      result.append('L');
      if (charAt(value, index + 1) == 'L') {
          if (conditionL0(value, index)) {
              result.appendAlternate(' ');
          }
          index += 2;
      } else {
          index++;
      }
      return index;
      

      Suggest refeactoring this to

      if (charAt(value, index + 1) == 'L') {
          if (conditionL0(value, index)) {
              result.appendPrimary('L');
          } else {
              result.append('L');
          }
          index += 2;
      } else {
          result.append('L');
          index++;
      }
      return index;
      

      3) Bug in the conditionL0() method for words ending in "AS" and "OS"

      • The alternative encoding of "gallegos" results in "KLKS" rather than "KKS"

      The problem is caused by the wrong start position being used in the contains() method, which means its not checking the last two characters of the word but checks the previous & current position instead:

              } else if ((contains(value, index - 1, 2, "AS", "OS") || 
      

      ...this should be

              } else if ((contains(value, value.length() - 2, 2, "AS", "OS") || 
      

      I'll attach a patch for review

      Attachments

        Activity

          People

            Unassigned Unassigned
            niallp Niall Pemberton
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: