Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Operating System: other
      Platform: Other

      Description

      September 10th 2003 contributionn from "Sergio Guzman-Lara" <guzman@cs.umass.edu>

      Original email:

      Hi all,

      I have ported the kstem stemmer to Java and incorporated it to
      Lucene. You can get the source code (Kstem.jar) from the following website:

      http://ciir.cs.umass.edu/downloads/

      Just click on "KStem Java Implementation" (you will need to register
      your e-mail, for free of course, with the CIIR --Center for Intelligent
      Information Retrieval, UMass – and get an access code).

      Content of Kstem.jar:

      java/org/apache/lucene/analysis/KStemData1.java
      java/org/apache/lucene/analysis/KStemData2.java
      java/org/apache/lucene/analysis/KStemData3.java
      java/org/apache/lucene/analysis/KStemData4.java
      java/org/apache/lucene/analysis/KStemData5.java
      java/org/apache/lucene/analysis/KStemData6.java
      java/org/apache/lucene/analysis/KStemData7.java
      java/org/apache/lucene/analysis/KStemData8.java
      java/org/apache/lucene/analysis/KStemFilter.java
      java/org/apache/lucene/analysis/KStemmer.java

      KStemData1.java, ..., KStemData8.java Contain several lists of words
      used by Kstem
      KStemmer.java Implements the Kstem algorithm
      KStemFilter.java Extends TokenFilter applying Kstem

      To compile

      unjar the file Kstem.jar to Lucene's "src" directory, and compile it
      there.

      What is Kstem?

      A stemmer designed by Bob Krovetz (for more information see
      http://ciir.cs.umass.edu/pubfiles/ir-35.pdf).

      Copyright issues

      This is open source. The actual license agreement is included at the
      top of every source file.

      Any comments/questions/suggestions are welcome,

      Sergio Guzman-Lara
      Senior Research Fellow
      CIIR UMass

      1. lucid_kstem.tgz
        650 kB
        Yonik Seeley
      2. LUCENE-152.patch
        388 kB
        Robert Muir
      3. LUCENE-152_optimization.patch
        0.9 kB
        Yonik Seeley
      4. LUCENE-152_optimization.patch
        2 kB
        Yonik Seeley
      5. LUCENE-152_alt.patch
        1 kB
        Robert Muir
      6. kstemTestData.zip
        54 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment -

          KStem would be nice to have in the contrib/Analysis package, but I don't see that UMass is donating the code to the ASF. It should be sufficient that people can go download and incorporate it quite easily into Lucene (I have done this in the past). Thus, I think it is fine to mark this as won't fix.

          Show
          Grant Ingersoll added a comment - KStem would be nice to have in the contrib/Analysis package, but I don't see that UMass is donating the code to the ASF. It should be sufficient that people can go download and incorporate it quite easily into Lucene (I have done this in the past). Thus, I think it is fine to mark this as won't fix.
          Hide
          Steve Rowe added a comment -

          If the original sources are BSD licensed, is a software grant required to incorporate the sources into the Lucene/Solr source tree?

          Show
          Steve Rowe added a comment - If the original sources are BSD licensed, is a software grant required to incorporate the sources into the Lucene/Solr source tree?
          Hide
          Mark Miller added a comment -

          The general rule is that if its a fair amount of code, and it was developed outside of the Apache system, we want a software grant - even if its Apache 2 licensed code.

          Show
          Mark Miller added a comment - The general rule is that if its a fair amount of code, and it was developed outside of the Apache system, we want a software grant - even if its Apache 2 licensed code.
          Hide
          Hoss Man added a comment -

          even if its Apache 2 licensed code.

          Uh... that may be a stretch.

          More specifically: compile time dependencies on compiled BSD libraries are fine, but actually incorporating and releasing code that is under a BSD license is something we're aren't suppose to do (last time i checked)

          Show
          Hoss Man added a comment - even if its Apache 2 licensed code. Uh... that may be a stretch. More specifically: compile time dependencies on compiled BSD libraries are fine, but actually incorporating and releasing code that is under a BSD license is something we're aren't suppose to do (last time i checked)
          Hide
          Mark Miller added a comment -

          Uh... that may be a stretch.

          It's what the incubator seems to recommend, and the side have err'd on in the past.

          http://incubator.apache.org/ip-clearance/index.html

          If it was developed outside of Apache, we don't really know it's IP history, and that's something we want to take seriously.

          Show
          Mark Miller added a comment - Uh... that may be a stretch. It's what the incubator seems to recommend, and the side have err'd on in the past. http://incubator.apache.org/ip-clearance/index.html If it was developed outside of Apache, we don't really know it's IP history, and that's something we want to take seriously.
          Hide
          Mark Miller added a comment -

          To extract a bit for clarity:

          This form is not for new projects. This is for projects and PMCs that have already been created and are receiving a code donation into an existing codebase. Any code that was developed outside of the ASF SVN repository and our public mailing lists must be processed like this, even if the external developer is already an ASF committer.

          Show
          Mark Miller added a comment - To extract a bit for clarity: This form is not for new projects. This is for projects and PMCs that have already been created and are receiving a code donation into an existing codebase. Any code that was developed outside of the ASF SVN repository and our public mailing lists must be processed like this, even if the external developer is already an ASF committer.
          Hide
          Mark Miller added a comment -

          More specifically: compile time dependencies on compiled BSD libraries are fine, but actually incorporating and releasing code that is under a BSD license is something we're aren't suppose to do (last time i checked)

          Code is fine to afaik:
          http://www.apache.org/legal/3party.html

          Show
          Mark Miller added a comment - More specifically: compile time dependencies on compiled BSD libraries are fine, but actually incorporating and releasing code that is under a BSD license is something we're aren't suppose to do (last time i checked) Code is fine to afaik: http://www.apache.org/legal/3party.html
          Hide
          Steve Rowe added a comment -

          Code is fine to afaik: http://www.apache.org/legal/3party.html

          My interpretation of this is that we can directly include the KStem source code in Lucene/Solr's source tree, and then modify it at will, since its license (BSD style) is in Category A (authorized licenses).

          Thoughts?

          Show
          Steve Rowe added a comment - Code is fine to afaik: http://www.apache.org/legal/3party.html My interpretation of this is that we can directly include the KStem source code in Lucene/Solr's source tree, and then modify it at will, since its license (BSD style) is in Category A (authorized licenses). Thoughts?
          Hide
          Michael McCandless added a comment -

          I think that's right.

          Show
          Michael McCandless added a comment - I think that's right.
          Hide
          Yonik Seeley added a comment -

          heh - I had heard enough times that the license wouldn't permit it that I never looked into it myself.
          http://markmail.org/message/zlett7y3dj76xa2f

          Anyway, I did a bunch of optimizations for Lucid's version way back when. It makes sense for those to be contributed back here... I'll see what I can do (but it might be delayed a week by everyone being busy at Lucene Revolution).

          Show
          Yonik Seeley added a comment - heh - I had heard enough times that the license wouldn't permit it that I never looked into it myself. http://markmail.org/message/zlett7y3dj76xa2f Anyway, I did a bunch of optimizations for Lucid's version way back when. It makes sense for those to be contributed back here... I'll see what I can do (but it might be delayed a week by everyone being busy at Lucene Revolution).
          Hide
          Steve Rowe added a comment -

          Reopening - license issues appear to have disappeared.

          Show
          Steve Rowe added a comment - Reopening - license issues appear to have disappeared.
          Hide
          Michael McCandless added a comment -

          Setting fix version... seems like there's alot of interest for this stemmer and Lucid's improvements to it.

          Show
          Michael McCandless added a comment - Setting fix version... seems like there's alot of interest for this stemmer and Lucid's improvements to it.
          Hide
          Jan Høydahl added a comment -

          +1

          Show
          Jan Høydahl added a comment - +1
          Hide
          Robert Zotter added a comment -

          +1

          Show
          Robert Zotter added a comment - +1
          Hide
          Yonik Seeley added a comment -

          OK folks, here's Lucid's optimized version of kstemmer. Changes by Lucid to the original kstemmer are being contributed under the ASL.

          This is not a patch, but simply a tarball of Lucid's version. Not sure what we want to do with some of the stuff (like the biggish test files).

          IIRC, there were two types of optimizations... one type was efficiency (i.e. using CharArrMap, directly using a char[] in the stemmer, etc). Other optimizations actually changed the logic and code paths though, which is one reason I tested it over a whole document to ensure it still matched the original.

          Show
          Yonik Seeley added a comment - OK folks, here's Lucid's optimized version of kstemmer. Changes by Lucid to the original kstemmer are being contributed under the ASL. This is not a patch, but simply a tarball of Lucid's version. Not sure what we want to do with some of the stuff (like the biggish test files). IIRC, there were two types of optimizations... one type was efficiency (i.e. using CharArrMap, directly using a char[] in the stemmer, etc). Other optimizations actually changed the logic and code paths though, which is one reason I tested it over a whole document to ensure it still matched the original.
          Hide
          Robert Muir added a comment -

          This is not a patch, but simply a tarball of Lucid's version. Not sure what we want to do with some of the stuff (like the biggish test files)

          I don't think biggish test files are a problem personally, we already have these for the snowball stemmers for example.

          Show
          Robert Muir added a comment - This is not a patch, but simply a tarball of Lucid's version. Not sure what we want to do with some of the stuff (like the biggish test files) I don't think biggish test files are a problem personally, we already have these for the snowball stemmers for example.
          Hide
          Ryan McKinley added a comment -

          I'm fine with the 1.2MB history_of_the_united_states.txt in the tests

          Show
          Ryan McKinley added a comment - I'm fine with the 1.2MB history_of_the_united_states.txt in the tests
          Hide
          Robert Muir added a comment -

          we can zip it anyway, the existing stemmer tests use zipped files for this exact purpose.
          zipped: all the test data is about 500KB

          our snowball test data currently in src/test is zipped 3.1MB... so I think 500kb is ok.

          Show
          Robert Muir added a comment - we can zip it anyway, the existing stemmer tests use zipped files for this exact purpose. zipped: all the test data is about 500KB our snowball test data currently in src/test is zipped 3.1MB... so I think 500kb is ok.
          Hide
          Robert Muir added a comment -

          patch file, testdata zip goes in src/test/org/apache/lucene/analysis/en

          The testdata was converted over to tab-separated to use VocabularyAssert. The big HistoryOfUnitedStates file was actually only used in the little benchmarker, not in tests.

          Show
          Robert Muir added a comment - patch file, testdata zip goes in src/test/org/apache/lucene/analysis/en The testdata was converted over to tab-separated to use VocabularyAssert. The big HistoryOfUnitedStates file was actually only used in the little benchmarker, not in tests.
          Hide
          Ryan McKinley added a comment -

          +1 Looks good!

          Should OpenStringBuilder and CharsRef be combined? If not, is OpenStringBuilder usful outside of analysis? Should it be in org.apache.lucene.util?

          Show
          Ryan McKinley added a comment - +1 Looks good! Should OpenStringBuilder and CharsRef be combined? If not, is OpenStringBuilder usful outside of analysis? Should it be in org.apache.lucene.util?
          Hide
          Robert Muir added a comment -

          Ryan: maybe, I thought of this too myself looking at the patch.

          Then again there are probably other kinds of refactoring improvements we could make... honestly I didn't dig deep enough into this one to see if it can be solved just by 'add Appendable interface to CharsRef' or to even think if thats the right thing to do.

          I don't think we should move it out of the analysis package for now (maybe i shouldn't have put it in util even in the patch) unless there's something else that actually wants to use it: I think this would be premature.

          Show
          Robert Muir added a comment - Ryan: maybe, I thought of this too myself looking at the patch. Then again there are probably other kinds of refactoring improvements we could make... honestly I didn't dig deep enough into this one to see if it can be solved just by 'add Appendable interface to CharsRef' or to even think if thats the right thing to do. I don't think we should move it out of the analysis package for now (maybe i shouldn't have put it in util even in the patch) unless there's something else that actually wants to use it: I think this would be premature.
          Hide
          Robert Muir added a comment -

          its also worth mentioning if the class is just used for appending (not sure if it is), we might be able to just append to the CharTermAttribute directly instead, it implements Appendable already.

          Show
          Robert Muir added a comment - its also worth mentioning if the class is just used for appending (not sure if it is), we might be able to just append to the CharTermAttribute directly instead, it implements Appendable already.
          Hide
          Ryan McKinley added a comment -

          great.

          Another thing that jumps out is

              CharArrayMap<DictEntry> d = new CharArrayMap<DictEntry>(
                  Version.LUCENE_31, 1000, false);
          

          Looks like we need to refactor:

          private static final CharArrayMap<DictEntry> dict_ht = initializeDictHash();
          

          so it can be passed the Lucene Version. I'm not sure we need it to be static either...

          I can take a look at that if you are not already on it

          Show
          Ryan McKinley added a comment - great. Another thing that jumps out is CharArrayMap<DictEntry> d = new CharArrayMap<DictEntry>( Version.LUCENE_31, 1000, false ); Looks like we need to refactor: private static final CharArrayMap<DictEntry> dict_ht = initializeDictHash(); so it can be passed the Lucene Version. I'm not sure we need it to be static either... I can take a look at that if you are not already on it
          Hide
          Robert Muir added a comment -

          Personally I don't think we should do this: if you look thru the analyzers you will find other similar code.

          Plus, there's no need to support the 'broken' pre-3.1 behavior here, since this thing isn't planned to be released until 3.3.

          Show
          Robert Muir added a comment - Personally I don't think we should do this: if you look thru the analyzers you will find other similar code. Plus, there's no need to support the 'broken' pre-3.1 behavior here, since this thing isn't planned to be released until 3.3.
          Hide
          Ryan McKinley added a comment -

          ok – just making sure there is nothing lost with Version.LUCENE_40

          +1 to commit

          Show
          Ryan McKinley added a comment - ok – just making sure there is nothing lost with Version.LUCENE_40 +1 to commit
          Hide
          Robert Muir added a comment -

          No, nothing will be lost... and actually since 'false' is passed here for ignoreCase, the constant does nothing... just looks wierd.

          Show
          Robert Muir added a comment - No, nothing will be lost... and actually since 'false' is passed here for ignoreCase, the constant does nothing... just looks wierd.
          Hide
          Robert Muir added a comment -

          Thanks for reviewing Ryan... i found some @authors just doing another scan, i'll nuke those before committing.

          Show
          Robert Muir added a comment - Thanks for reviewing Ryan... i found some @authors just doing another scan, i'll nuke those before committing.
          Hide
          Robert Muir added a comment -

          Committed revision 1130527 (trunk), 1130532 (branch_3x).

          Thanks Yonik!

          Show
          Robert Muir added a comment - Committed revision 1130527 (trunk), 1130532 (branch_3x). Thanks Yonik!
          Hide
          Ryan McKinley added a comment -

          wow, closing a ticket from 2003!

          Thanks Robert, Yonik, etc

          Show
          Ryan McKinley added a comment - wow, closing a ticket from 2003! Thanks Robert, Yonik, etc
          Hide
          Yonik Seeley added a comment -

          very minor optimization to avoid a char[] allocation per stemmed word.

          Show
          Yonik Seeley added a comment - very minor optimization to avoid a char[] allocation per stemmed word.
          Hide
          Robert Muir added a comment -

          why create strings either?

          Show
          Robert Muir added a comment - why create strings either?
          Hide
          Yonik Seeley added a comment -

          why create strings either?

          Good point. I assume you mean something like this patch?

          Show
          Yonik Seeley added a comment - why create strings either? Good point. I assume you mean something like this patch?
          Hide
          Robert Muir added a comment -

          it looks good... i think its the same as the patch i uploaded (_alt.patch)... only i used the .append syntactic sugar

          Show
          Robert Muir added a comment - it looks good... i think its the same as the patch i uploaded (_alt.patch)... only i used the .append syntactic sugar
          Hide
          Yonik Seeley added a comment -

          i think its the same as the patch i uploaded

          D'oh! I hate that the "All" tab in JIRA isn't selected by default (and hence one doesn't see stuff like file uploads

          Show
          Yonik Seeley added a comment - i think its the same as the patch i uploaded D'oh! I hate that the "All" tab in JIRA isn't selected by default (and hence one doesn't see stuff like file uploads
          Hide
          Robert Muir added a comment -

          bulk close for 3.3

          Show
          Robert Muir added a comment - bulk close for 3.3
          Hide
          ofer fort added a comment -

          sorry for the reopen, but why is the constructor of KStemmer not public?

          Show
          ofer fort added a comment - sorry for the reopen, but why is the constructor of KStemmer not public?

            People

            • Assignee:
              Robert Muir
              Reporter:
              Otis Gospodnetic
            • Votes:
              9 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development