Commons Codec
  1. Commons Codec
  2. CODEC-57

Metaphone.metaphone(String) returns an empty string when passed the word "why".

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Labels:
      None
    • Environment:

      Commons-codec built from source using jdk 1.4.2.
      OS: Windows XP
      Java Build: 1.4.2

      Description

      An empty string is returned from the Metaphone.metaphone(String) method when passed the value "why". Variations on the value, such as "wwwhy" and "wwhhhy" also return empty strings.

      This appears to be an issue since other implementations of the metaphone algorithm, namely the PHP version, returns "H" when passed the value "why".

        Activity

        Hide
        Henri Yandell added a comment -

        Note - the Perl variant can run in original-mode.

        I've added a note to the javadoc about this issue.

        Show
        Henri Yandell added a comment - Note - the Perl variant can run in original-mode. I've added a note to the javadoc about this issue.
        Hide
        Henri Yandell added a comment -

        I'm marking this as WONTFIX. I don't think we should change our Metaphone from adhering to the original algorithm to adhere to PHP's modified algorithm, which looks like it is based on the Perl Text-Metaphone-1.96 algorithm.

        http://search.cpan.org/~mschwern/Text-Metaphone-1.96/Metaphone.pm

        That version states that it changes the algorithm; so either this change is from the Perl original, or added later to the PHP clone. Looking at the PHP source, it looks like there are other differences. At least the 'CC'->'GG' one i mentioned above, but looks like others too.

        To fix this, we would need to add a TextMetaphone class that ports the changes made in the Perl original; but I think licensing will hurt us there. I don't think we can take PHP or Perl code and put it in Apache licensed code. Implementing by reverse engineering seems like a very tiring task, I know I don't have the energy for it

        Show
        Henri Yandell added a comment - I'm marking this as WONTFIX. I don't think we should change our Metaphone from adhering to the original algorithm to adhere to PHP's modified algorithm, which looks like it is based on the Perl Text-Metaphone-1.96 algorithm. http://search.cpan.org/~mschwern/Text-Metaphone-1.96/Metaphone.pm That version states that it changes the algorithm; so either this change is from the Perl original, or added later to the PHP clone. Looking at the PHP source, it looks like there are other differences. At least the 'CC'->'GG' one i mentioned above, but looks like others too. To fix this, we would need to add a TextMetaphone class that ports the changes made in the Perl original; but I think licensing will hurt us there. I don't think we can take PHP or Perl code and put it in Apache licensed code. Implementing by reverse engineering seems like a very tiring task, I know I don't have the energy for it
        Hide
        Henri Yandell added a comment -

        It's a bug in the algorithm.

        The problem with both fixes is in the other words they affect.

        So PHP makes WHY->H and affects things like WHAT being HT and not WT.
        My suggestion makes WHY->Y, and changes WHYTE from T to WT.

        PHP change all 'WH' from W to H. Mine only affects WH?Y and the result is to make the W more sticky. I think mine's going to affect a lot less words (so better for backwards compat) and also more correct as it creates better looking tokens, but I'm hardly an expert at any of this. It's entirely possible that an empty String is intended to be an acceptable Soundex.

        Or the best solution might be to, if the metaphone is an empty String; simply return the first character of the input.

        Show
        Henri Yandell added a comment - It's a bug in the algorithm. The problem with both fixes is in the other words they affect. So PHP makes WHY->H and affects things like WHAT being HT and not WT. My suggestion makes WHY->Y, and changes WHYTE from T to WT. PHP change all 'WH' from W to H. Mine only affects WH?Y and the result is to make the W more sticky. I think mine's going to affect a lot less words (so better for backwards compat) and also more correct as it creates better looking tokens, but I'm hardly an expert at any of this. It's entirely possible that an empty String is intended to be an acceptable Soundex. Or the best solution might be to, if the metaphone is an empty String; simply return the first character of the input.
        Hide
        ggregory@seagullsw.com added a comment -

        If there any consensus on which is "correct"? WH->W or WY->Y? Matching Python and Ruby is besides the point IMO. If WH->W is a bug then there is no sense replicating it

        Show
        ggregory@seagullsw.com added a comment - If there any consensus on which is "correct"? WH->W or WY->Y? Matching Python and Ruby is besides the point IMO. If WH->W is a bug then there is no sense replicating it
        Hide
        Henri Yandell added a comment -

        I think that unless we hear otherwise, we should not change the CC/GG bit. People's data needs to be likely to work, regardless of the better algorithm etc.

        With this specific bug though, having an empty string feels bad. Both the Ruby and Python implementations do WH->W. I don't feel a huge urge to make it WH->H just to match PHP. I'm more tempted to do the WY->Y.

        Any thoughts?

        Show
        Henri Yandell added a comment - I think that unless we hear otherwise, we should not change the CC/GG bit. People's data needs to be likely to work, regardless of the better algorithm etc. With this specific bug though, having an empty string feels bad. Both the Ruby and Python implementations do WH->W. I don't feel a huge urge to make it WH->H just to match PHP. I'm more tempted to do the WY->Y. Any thoughts?
        Hide
        Henri Yandell added a comment -

        Also... can't help but think that there's sod all testing in MetaphoneTest. We need to add some test cases.

        Show
        Henri Yandell added a comment - Also... can't help but think that there's sod all testing in MetaphoneTest. We need to add some test cases.
        Hide
        Henri Yandell added a comment -

        Looks like we do the MB->M CODEC-17.

        Show
        Henri Yandell added a comment - Looks like we do the MB->M CODEC-17 .
        Hide
        Henri Yandell added a comment -

        Incidentally, those other bugs are that 'CC' should be considered as not allowed duplication and that MB->M should only happen at the end of a word. An unmentioned change is that they've also made 'GG' an allowed duplication.

        Show
        Henri Yandell added a comment - Incidentally, those other bugs are that 'CC' should be considered as not allowed duplication and that MB->M should only happen at the end of a word. An unmentioned change is that they've also made 'GG' an allowed duplication.
        Hide
        Henri Yandell added a comment -

        Looking at the code; the first step it does is turn the WH into a W.

        Then later on, both W and Y are silent if they are not followed by a vowel.

        Playing with the link above, it looks like WH is turned into H there. A quick look at the source code to PHP shows that it is indeed converted to H. Another quick look, this time at text.rubyforge shows that the Ruby version converts to W as we do [though it claims to compare with the PHP version for differences].

        Looking at DoubleMetaphone, it handles WH differently. If ^WH, then it'll append an A.

        Looking at the original BASIC code [as posted by aspell.sf.net]:

        IF TWO = "WH" THEN ENAME = "W":ENAME[3,9999]

        So it looks like PHP are the one with the bigger bug - a surname of WHYE should be YE and not HE. Then it seems that Metaphone itself is weak in that (my opinion) it should consider 'Y' a vowel when looking after 'W' for a vowel.

        I'm not sure what we should do though. The documentation at http://text.rubyforge.org/svn/lib/text/metaphone.rb also indicates that there are other bugs in the original BASIC compared to the original discussion (anyone got that magazine article? ). So this might just be a bug in the BASIC implementation rather than the original algorithm.

        Show
        Henri Yandell added a comment - Looking at the code; the first step it does is turn the WH into a W. Then later on, both W and Y are silent if they are not followed by a vowel. Playing with the link above, it looks like WH is turned into H there. A quick look at the source code to PHP shows that it is indeed converted to H. Another quick look, this time at text.rubyforge shows that the Ruby version converts to W as we do [though it claims to compare with the PHP version for differences] . Looking at DoubleMetaphone, it handles WH differently. If ^WH, then it'll append an A. Looking at the original BASIC code [as posted by aspell.sf.net] : IF TWO = "WH" THEN ENAME = "W":ENAME [3,9999] So it looks like PHP are the one with the bigger bug - a surname of WHYE should be YE and not HE. Then it seems that Metaphone itself is weak in that (my opinion) it should consider 'Y' a vowel when looking after 'W' for a vowel. I'm not sure what we should do though. The documentation at http://text.rubyforge.org/svn/lib/text/metaphone.rb also indicates that there are other bugs in the original BASIC compared to the original discussion (anyone got that magazine article? ). So this might just be a bug in the BASIC implementation rather than the original algorithm.
        Hide
        ggregory@seagullsw.com added a comment -

        I have a unit test that confirms this (committed). I can see the "H" in "WHY" also here: http://www.searchforancestors.com/utility/metaphone.php (picked from Google search results)

        Show
        ggregory@seagullsw.com added a comment - I have a unit test that confirms this (committed). I can see the "H" in "WHY" also here: http://www.searchforancestors.com/utility/metaphone.php (picked from Google search results)

          People

          • Assignee:
            Unassigned
            Reporter:
            Adam Wilmore
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development