Issue Details (XML | Word | Printable)

Key: CODEC-84
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Unassigned
Reporter: Niall Pemberton
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Commons Codec

Double Metaphone bugs in alternative encoding

Created: 02/Aug/09 12:09 AM   Updated: 07/Aug/09 08:10 AM
Return to search
Component/s: None
Affects Version/s: 1.3
Fix Version/s: 1.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works CODEC-84-DoubleMetaphone-Alternate-bugs.patch 2009-08-02 12:22 AM Niall Pemberton 4 kB

Resolution Date: 03/Aug/09 12:38 AM


 Description  « Hide
The new test case (CODEC-83) has highlighted a number of issues with the "alternative" encoding in the Double Metaphone implementation

1) Bug in the handleG method when "G" is followed by "IER"

  • The alternative encoding of "Angier" results in "ANKR" rather than "ANJR"
  • The alternative encoding of "rogier" results in "RKR" rather than "RJR"

The problem is in the handleG() method and is caused by the wrong length (4 instead of 3) being used in the contains() method:

} else if (contains(value, index + 1, 4, "IER")) {

...this should be

} else if (contains(value, index + 1, 3, "IER")) {

2) Bug in the handleL method

  • The alternative encoding of "cabrillo" results in "KPRL " rather than "KPR"

The problem is that the first thing this method does is append an "L" to both primary & alternative encoding. When the conditionL0() method returns true then the "L" should not be appended for the alternative encoding

result.append('L');
if (charAt(value, index + 1) == 'L') {
    if (conditionL0(value, index)) {
        result.appendAlternate(' ');
    }
    index += 2;
} else {
    index++;
}
return index;

Suggest refeactoring this to

if (charAt(value, index + 1) == 'L') {
    if (conditionL0(value, index)) {
        result.appendPrimary('L');
    } else {
        result.append('L');
    }
    index += 2;
} else {
    result.append('L');
    index++;
}
return index;

3) Bug in the conditionL0() method for words ending in "AS" and "OS"

  • The alternative encoding of "gallegos" results in "KLKS" rather than "KKS"

The problem is caused by the wrong start position being used in the contains() method, which means its not checking the last two characters of the word but checks the previous & current position instead:

} else if ((contains(value, index - 1, 2, "AS", "OS") ||

...this should be

} else if ((contains(value, value.length() - 2, 2, "AS", "OS") ||

I'll attach a patch for review



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Niall Pemberton made changes - 02/Aug/09 12:21 AM
Field Original Value New Value
Description The new test case (CODEC-83) has highlighted a number of issues with the "alternative" encoding in the Double Metaphone implementation

1) Bug in the handleG method when "G" is followed by "IER"
 * The alternative encoding of "Angier" results in "ANKR" rather than "ANJR"
 * The alternative encoding of "rogier" results in "RKR" rather than "RJR"

The problem is in the handleG() method and is caused by the wrong length (4 instead of 3) being used in the contains() method:

{code}
 } else if (contains(value, index + 1, 4, "IER")) {
{code}

...this should be

{code}
 } else if (contains(value, index + 1, 3, "IER")) {
{code}


2) Bug in the handleL method
 * The alternative encoding of "cabrillo" results in "KPRL " rather than "KPR"

The problem is that the first thing this method does is append an "L" to both primary & alternative encoding. When the conditionL0() method returns true then the "L" should not be appended for the alternative encoding

{code}
        result.append('L');
        if (charAt(value, index + 1) == 'L') {
            if (conditionL0(value, index)) {
                result.appendAlternate(' ');
            }
            index += 2;
        } else {
            index++;
        }
        return index;
{code}

Suggest refeactoring this to

{code}
        if (charAt(value, index + 1) == 'L') {
            if (conditionL0(value, index)) {
                result.appendPrimary('L');
            } else {
                result.append('L');
            }
            index += 2;
        } else {
            result.append('L');
            index++;
        }
        return index;
{code}

3) Bug in the conditionL0() method for words ending in "AS" and "OS"
 * The alternative encoding of "gallegos" results in "KLKS" rather than "KKS"

The problem is caused by the wrong start position being used in the contains() method, which means its not checking the last two characters of the word but checks the previous & current position instead:

{code}
        } else if ((contains(value, index - 1, 2, "AS", "OS") ||
{code}

...this should be

{code}
        } else if ((contains(value, value.length() - 2, 2, "AS", "OS") ||
{code}

I'll attach a patch for review
The new test case (CODEC-83) has highlighted a number of issues with the "alternative" encoding in the Double Metaphone implementation

1) Bug in the handleG method when "G" is followed by "IER"
 * The alternative encoding of "Angier" results in "ANKR" rather than "ANJR"
 * The alternative encoding of "rogier" results in "RKR" rather than "RJR"

The problem is in the handleG() method and is caused by the wrong length (4 instead of 3) being used in the contains() method:

{code}
 } else if (contains(value, index + 1, 4, "IER")) {
{code}

...this should be

{code}
 } else if (contains(value, index + 1, 3, "IER")) {
{code}


2) Bug in the handleL method
 * The alternative encoding of "cabrillo" results in "KPRL " rather than "KPR"

The problem is that the first thing this method does is append an "L" to both primary & alternative encoding. When the conditionL0() method returns true then the "L" should not be appended for the alternative encoding

{code}
result.append('L');
if (charAt(value, index + 1) == 'L') {
    if (conditionL0(value, index)) {
        result.appendAlternate(' ');
    }
    index += 2;
} else {
    index++;
}
return index;
{code}

Suggest refeactoring this to

{code}
if (charAt(value, index + 1) == 'L') {
    if (conditionL0(value, index)) {
        result.appendPrimary('L');
    } else {
        result.append('L');
    }
    index += 2;
} else {
    result.append('L');
    index++;
}
return index;
{code}

3) Bug in the conditionL0() method for words ending in "AS" and "OS"
 * The alternative encoding of "gallegos" results in "KLKS" rather than "KKS"

The problem is caused by the wrong start position being used in the contains() method, which means its not checking the last two characters of the word but checks the previous & current position instead:

{code}
        } else if ((contains(value, index - 1, 2, "AS", "OS") ||
{code}

...this should be

{code}
        } else if ((contains(value, value.length() - 2, 2, "AS", "OS") ||
{code}

I'll attach a patch for review
Niall Pemberton made changes - 02/Aug/09 12:22 AM
Repository Revision Date User Message
ASF #800153 Sun Aug 02 22:45:30 UTC 2009 ggregory [CODEC-84] Double Metaphone bugs in alternative encoding. Apply patch. Thank you Niall.
Files Changed
MODIFY /commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java
MODIFY /commons/proper/codec/trunk/src/test/org/apache/commons/codec/language/DoubleMetaphone2Test.java

Repository Revision Date User Message
ASF #800163 Sun Aug 02 23:51:03 UTC 2009 niallp Update release notes for CODEC-84
Files Changed
MODIFY /commons/proper/codec/trunk/RELEASE-NOTES.txt
MODIFY /commons/proper/codec/trunk/xdocs/changes.xml

Niall Pemberton made changes - 03/Aug/09 12:38 AM
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Niall Pemberton made changes - 07/Aug/09 08:10 AM
Status Resolved [ 5 ] Closed [ 6 ]