Lucene - Core
  1. Lucene - Core
  2. LUCENE-740

Bugs in contrib/snowball/.../SnowballProgram.java -> Kraaij-Pohlmann gives Index-OOB Exception

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 1.9
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      linux amd64

    • Lucene Fields:
      New, Patch Available

      Description

      (copied from mail to java-user)
      while playing with the various stemmers of Lucene(-1.9.1), I got an
      index out of bounds exception:

      lucene-1.9.1>java -cp
      build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
      net.sf.snowball.TestApp Kp bla.txt
      Exception in thread "main" java.lang.reflect.InvocationTargetException
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:615)
      at net.sf.snowball.TestApp.main(TestApp.java:56)
      Caused by: java.lang.StringIndexOutOfBoundsException: String index out
      of range: 11
      at java.lang.StringBuffer.charAt(StringBuffer.java:303)
      at net.sf.snowball.SnowballProgram.find_among_b(SnowballProgram.java:270)
      at net.sf.snowball.ext.KpStemmer.r_Step_4(KpStemmer.java:1122)
      at net.sf.snowball.ext.KpStemmer.stem(KpStemmer.java:1997)

      This happens when executing
      lucene-1.9.1>java -cp
      build/contrib/snowball/lucene-snowball-1.9.2-dev.jar
      net.sf.snowball.TestApp Kp bla.txt

      bla.txt contains just this word: 'spijsvertering'.

      After some debugging, and some tests with the original snowball
      distribution from snowball.tartarus.org, it seems that the attached
      change is needed to avoid the exception.
      (The change comes from tartarus' SnowballProgram.java)

      1. 740-license.txt
        2 kB
        Steven Parkes
      2. lucene-1.9.1-SnowballProgram.java
        1 kB
        Andreas Kohn
      3. snowball.patch.txt
        296 kB
        Doron Cohen

        Issue Links

          Activity

          Hide
          Andreas Kohn added a comment -

          The patch is based on SnowballProgram.java as found on snowball.tartarus.org, so their licensing applies.

          Show
          Andreas Kohn added a comment - The patch is based on SnowballProgram.java as found on snowball.tartarus.org, so their licensing applies.
          Hide
          Yonik Seeley added a comment -

          Speaking of licensing, that should probably be cleaned up.

          Show
          Yonik Seeley added a comment - Speaking of licensing, that should probably be cleaned up.
          Hide
          Doron Cohen added a comment -

          In addition to SnowballProgram bug fix there are few updates in snowball.tartarus.org comparing to snowball stemmers in Lucene, and Hungarian stemmer was added. Any reason not to update all the stemmers with this fix?

          Show
          Doron Cohen added a comment - In addition to SnowballProgram bug fix there are few updates in snowball.tartarus.org comparing to snowball stemmers in Lucene, and Hungarian stemmer was added. Any reason not to update all the stemmers with this fix?
          Hide
          Otis Gospodnetic added a comment -

          +1 for latest and greatest.

          Show
          Otis Gospodnetic added a comment - +1 for latest and greatest.
          Hide
          Doron Cohen added a comment -

          Updated + new stemmers and SnowballProgram fix from http://snowball.tartarus.org

          Show
          Doron Cohen added a comment - Updated + new stemmers and SnowballProgram fix from http://snowball.tartarus.org
          Hide
          Doron Cohen added a comment -

          Attached "snowball.patch.txt" has "latest and greatest" plus new test case in TestSnowball that demostrates this Kp stemmer bug.

          Lucene tests and contrib/snowball tests pass.

          Show
          Doron Cohen added a comment - Attached "snowball.patch.txt" has "latest and greatest" plus new test case in TestSnowball that demostrates this Kp stemmer bug. Lucene tests and contrib/snowball tests pass.
          Hide
          Doron Cohen added a comment -

          Two comments:

          1. Testing: There's only limited testing in Lucene's contrib for these stemmers - we could probably add a simple test for each stemmer.

          2. Licensing: when attaching the patch I granted it for ASF inclusion. But this only covers my (minimal) changes to this code. Stemmers themselves go under Snowball licensing - http://snowball.tartarus.org/license.php

          Show
          Doron Cohen added a comment - Two comments: 1. Testing: There's only limited testing in Lucene's contrib for these stemmers - we could probably add a simple test for each stemmer. 2. Licensing: when attaching the patch I granted it for ASF inclusion. But this only covers my (minimal) changes to this code. Stemmers themselves go under Snowball licensing - http://snowball.tartarus.org/license.php
          Hide
          Steven Parkes added a comment -

          I'm kind of wondering about the snowball licensing, so I'm intrigued by Yonik's comment. Cleanup is necessary?

          Did the original snowball authors agree to license the software under the AL2.0? That's what LICENSE.txt says now. The source site cites the BSD license and says you can't claim it's licensed under another license.

          Show
          Steven Parkes added a comment - I'm kind of wondering about the snowball licensing, so I'm intrigued by Yonik's comment. Cleanup is necessary? Did the original snowball authors agree to license the software under the AL2.0? That's what LICENSE.txt says now. The source site cites the BSD license and says you can't claim it's licensed under another license.
          Hide
          Doug Cutting added a comment -

          This is a good question. We redistribute stuff generated from Snowball sources, not the original files. Does this constitute a "redistribution in binary form"?

          I think the LICENSE.txt here refers to the code that's included in this sub-tree, which is Apache-licensed. So that's okay. If anything we might need to add something to NOTICE.txt and/or include a copy of Snowball's BSD license too, as something like SNOWBALL-LICENSE.txt.

          Show
          Doug Cutting added a comment - This is a good question. We redistribute stuff generated from Snowball sources, not the original files. Does this constitute a "redistribution in binary form"? I think the LICENSE.txt here refers to the code that's included in this sub-tree, which is Apache-licensed. So that's okay. If anything we might need to add something to NOTICE.txt and/or include a copy of Snowball's BSD license too, as something like SNOWBALL-LICENSE.txt.
          Hide
          Steven Parkes added a comment -

          I don't see that "redistribution in binary form" makes any difference as far as the BSD license is concerned. The only difference between source and binary by BSD is the condition that the license terms be included in the docs as opposed to the sources.

          It looks like an explicit ASF policy on 3party inclusion is in the works:http://people.apache.org/~cliffs/3party.html but at this point it's only a proposal.

          If that, or something close to it becomes policy, It doesn't look like the snowball stuff poses any problem: the BSD is a Category A (good) license.

          At some point it looks like the policy will require highlighting the fact that inclusion of the snowball stuff makes the affected distributions "multi-licensed", but that doesn't look terribly onerous.

          I've added a patch with a copy of the BSD license suitably modified (they only reference the BSD license in the snowball materials) and I've added a few lines to NOTICE.txt as seems to be required: http://www.apache.org/licenses/example-NOTICE.txt

          Show
          Steven Parkes added a comment - I don't see that "redistribution in binary form" makes any difference as far as the BSD license is concerned. The only difference between source and binary by BSD is the condition that the license terms be included in the docs as opposed to the sources. It looks like an explicit ASF policy on 3party inclusion is in the works: http://people.apache.org/~cliffs/3party.html but at this point it's only a proposal. If that, or something close to it becomes policy, It doesn't look like the snowball stuff poses any problem: the BSD is a Category A (good) license. At some point it looks like the policy will require highlighting the fact that inclusion of the snowball stuff makes the affected distributions "multi-licensed", but that doesn't look terribly onerous. I've added a patch with a copy of the BSD license suitably modified (they only reference the BSD license in the snowball materials) and I've added a few lines to NOTICE.txt as seems to be required : http://www.apache.org/licenses/example-NOTICE.txt
          Hide
          Steven Parkes added a comment -

          Do we want to consider this a candidate for 2.2? In any case, the license files in the patch could be applied, since 2.2 seems to be catching lots of those.

          Show
          Steven Parkes added a comment - Do we want to consider this a candidate for 2.2? In any case, the license files in the patch could be applied, since 2.2 seems to be catching lots of those.
          Hide
          Michael Busch added a comment -

          I think it makes sense to apply the license patch for 2.2.

          I will commit it today in case there are no objections.

          Show
          Michael Busch added a comment - I think it makes sense to apply the license patch for 2.2. I will commit it today in case there are no objections.
          Hide
          Michael Busch added a comment -

          I committed the license patch. We should probably add SNOWBALL-LICENSE.TXT
          to the META-INF dir of the snowball jar after LUCENE-908 is committed and the
          manifests are customizable.

          Thanks for the patch, Steven!

          Show
          Michael Busch added a comment - I committed the license patch. We should probably add SNOWBALL-LICENSE.TXT to the META-INF dir of the snowball jar after LUCENE-908 is committed and the manifests are customizable. Thanks for the patch, Steven!
          Hide
          Karl Wettin added a comment -

          Duplicate, see LUCENE-1142

          Show
          Karl Wettin added a comment - Duplicate, see LUCENE-1142

            People

            • Assignee:
              Unassigned
              Reporter:
              Andreas Kohn
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development