Lucene - Core
  1. Lucene - Core
  2. LUCENE-6586

There is a typo in GermanStemmer that can lead to wrong stemming

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 5.2.1
    • Fix Version/s: 5.3, 6.0
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      There is a small typo in GermanStemmer that leads to a wrong calclulation of the substCount in line 203:

      substCount =+ 2;

      should be

      substCount += 2;

      I created a Pull Request for this some time ago, but it was apprently overlooked:
      https://github.com/apache/lucene-solr/pull/141

        Activity

        Hide
        Erick Erickson added a comment -

        Thanks for reporting this! Yeah, Git is there for convenience, JIRAs are how we
        track code changes, the "usual" method of drawing attention to a code problem
        is to attach a patch to a JIRA.

        I noticed that the "How To Contribute" page really doesn't mention this, I'll see
        if I can update it.

        Thanks for both noticing the original problem and your persistence bringing it
        to our attention!

        Show
        Erick Erickson added a comment - Thanks for reporting this! Yeah, Git is there for convenience, JIRAs are how we track code changes, the "usual" method of drawing attention to a code problem is to attach a patch to a JIRA. I noticed that the "How To Contribute" page really doesn't mention this, I'll see if I can update it. Thanks for both noticing the original problem and your persistence bringing it to our attention!
        Hide
        Michael McCandless added a comment -

        Is there a small test case that exposes this bug? I'd like to fix this, and it certainly looks like a silly bug, but I don't fully understand GermStemmer so it would be nice to shine a light on the bug first with a test case.

        Show
        Michael McCandless added a comment - Is there a small test case that exposes this bug? I'd like to fix this, and it certainly looks like a silly bug, but I don't fully understand GermStemmer so it would be nice to shine a light on the bug first with a test case.
        Hide
        Christoph Kaser added a comment -

        Hi Michael,

        I tried to write a small test case and realized that there is no input that leads to a wrong token.
        substCount is only used to decide how large the original input was, because some suffixes are only stripped if the token has a minimum length.

        if ( ( buffer.length() + substCount > 5 ) &&
              buffer.substring( buffer.length() - 2, buffer.length() ).equals( "nd" ) )
            {
              buffer.delete( buffer.length() - 2, buffer.length() );
            }
        

        However, every substitution leaves at least one character. For the bug to take effect, there has to be a substitution before the one that sets substCount to 2 (instead of incrementing it by 2).
        So we have

        • 2 characters that where left by the (at least 2) substitutions
        • the suffix "nd"
        • substCount, which was set to 2
          That sums up to 6 , which is greater than 5

        The other conditions that check on substCount work the same, except they check for greater than 4.

        Therefore, there is no token that triggers any wrong behaviour.

        Still, I think the typo should be fixed, because it might be copied to a place where it has an effect.

        Show
        Christoph Kaser added a comment - Hi Michael, I tried to write a small test case and realized that there is no input that leads to a wrong token. substCount is only used to decide how large the original input was, because some suffixes are only stripped if the token has a minimum length. if ( ( buffer.length() + substCount > 5 ) && buffer.substring( buffer.length() - 2, buffer.length() ).equals( "nd" ) ) { buffer.delete( buffer.length() - 2, buffer.length() ); } However, every substitution leaves at least one character. For the bug to take effect, there has to be a substitution before the one that sets substCount to 2 (instead of incrementing it by 2). So we have 2 characters that where left by the (at least 2) substitutions the suffix "nd" substCount, which was set to 2 That sums up to 6 , which is greater than 5 The other conditions that check on substCount work the same, except they check for greater than 4. Therefore, there is no token that triggers any wrong behaviour. Still, I think the typo should be fixed, because it might be copied to a place where it has an effect.
        Hide
        Michael McCandless added a comment -

        OK, thanks for exploring for a possible test case Christoph Kaser, I'll commit your fix as is!

        Show
        Michael McCandless added a comment - OK, thanks for exploring for a possible test case Christoph Kaser , I'll commit your fix as is!
        Hide
        ASF subversion and git services added a comment -

        Commit 1687950 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1687950 ]

        LUCENE-6586: fix typo causing possible wrong value for substCount

        Show
        ASF subversion and git services added a comment - Commit 1687950 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1687950 ] LUCENE-6586 : fix typo causing possible wrong value for substCount
        Hide
        ASF subversion and git services added a comment -

        Commit 1687951 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1687951 ]

        LUCENE-6586: fix typo causing possible wrong value for substCount

        Show
        ASF subversion and git services added a comment - Commit 1687951 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1687951 ] LUCENE-6586 : fix typo causing possible wrong value for substCount
        Hide
        Michael McCandless added a comment -
        Show
        Michael McCandless added a comment - Thanks Christoph Kaser !
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Christoph Kaser
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development