Lucene - Core
  1. Lucene - Core
  2. LUCENE-2055

Fix buggy stemmers and Remove duplicate analysis functionality

    Details

    • Lucene Fields:
      New

      Description

      would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.

      • analyzers/fr
      • analyzers/nl
      • analyzers/ru

      below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
      I think we should delete all of this custom stemming code in favor of the actual snowball package.

      /**
       * A stemmer for French words. 
       * <p>
       * The algorithm is based on the work of
       * Dr Martin Porter on his snowball project<br>
       * refer to http://snowball.sourceforge.net/french/stemmer.html<br>
       * (French stemming algorithm) for details
       * </p>
       */
      
      public class FrenchStemmer {
      
      /**
       * A stemmer for Dutch words. 
       * <p>
       * The algorithm is an implementation of
       * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
       * algorithm in Martin Porter's snowball project.
       * </p>
       */
      public class DutchStemmer {
      
      /**
       * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
       */
      class RussianStemmer
      
      1. LUCENE-2055.patch
        163 kB
        Robert Muir
      2. LUCENE-2055.patch
        163 kB
        Robert Muir
      3. LUCENE-2055.patch
        169 kB
        Robert Muir
      4. LUCENE-2055.patch
        168 kB
        Robert Muir
      5. LUCENE-2055.patch
        179 kB
        Robert Muir

        Issue Links

          Activity

          Robert Muir created issue -
          Hide
          Mark Miller added a comment -

          +1 - lets lose it.

          Show
          Mark Miller added a comment - +1 - lets lose it.
          Hide
          Robert Muir added a comment -

          One thing I would like to fix: these packages have stoplists, but our snowball implementation is missing the stoplists from the snowball dist.

          These are provided as .txt files in the full snowball distribution, so I think it would be an easy improvement to the snowball pkg to make these available somehow.

          Show
          Robert Muir added a comment - One thing I would like to fix: these packages have stoplists, but our snowball implementation is missing the stoplists from the snowball dist. These are provided as .txt files in the full snowball distribution, so I think it would be an easy improvement to the snowball pkg to make these available somehow.
          Hide
          Robert Muir added a comment -

          setting to 3.1, because I would like to make use of simon's stopword-handling improvements to tie in the snowball stoplists.

          this way we are not taking any functionality away.

          Show
          Robert Muir added a comment - setting to 3.1, because I would like to make use of simon's stopword-handling improvements to tie in the snowball stoplists. this way we are not taking any functionality away.
          Robert Muir made changes -
          Field Original Value New Value
          Fix Version/s 3.1 [ 12314025 ]
          Robert Muir made changes -
          Link This issue depends on LUCENE-2206 [ LUCENE-2206 ]
          Hide
          Robert Muir added a comment -

          Now that we have snowball tests, I started looking at integrating snowball and deprecating this custom code.
          So I ran the snowball tests against these hand-coded algorithms to see what the differences are... remember they all claim to implement porter:

          • RussianStemFilter one passes 100% all snowball tests.
          • DutchStemFilter passes 98.9% of snowball tests. all bugs were in handling of double consonants:
            examples:
            aangetroffen -> aangetrof expected: aangetroff
            afvoerbonnen -> afvoerbon expected: afvoerbonn
            klommen -> klom expected: klomm
          • FrenchStemFilter only passes 93.92% of snowball tests. but if you put lowercasefilter after it, it passes 99.66%!
            The problem is this stemmer incorrectly creates some uppercase stems from lowercase words. examples:
            xviii -> xviI expected: xvii
            vouer -> voU expected: vou
            tranquille -> tranqUill expected: tranquill
          Show
          Robert Muir added a comment - Now that we have snowball tests, I started looking at integrating snowball and deprecating this custom code. So I ran the snowball tests against these hand-coded algorithms to see what the differences are... remember they all claim to implement porter: RussianStemFilter one passes 100% all snowball tests. DutchStemFilter passes 98.9% of snowball tests. all bugs were in handling of double consonants: examples: aangetroffen -> aangetrof expected: aangetroff afvoerbonnen -> afvoerbon expected: afvoerbonn klommen -> klom expected: klomm FrenchStemFilter only passes 93.92% of snowball tests. but if you put lowercasefilter after it, it passes 99.66%! The problem is this stemmer incorrectly creates some uppercase stems from lowercase words. examples: xviii -> xviI expected: xvii vouer -> voU expected: vou tranquille -> tranqUill expected: tranquill
          Robert Muir made changes -
          Link This issue depends on LUCENE-2198 [ LUCENE-2198 ]
          Hide
          Robert Muir added a comment -

          linking to LUCENE-2198, when we replace some of these buggy stemfilters with snowball ones, we will need it to implement the protected word support already exposed by the relevant analyzers.

          Show
          Robert Muir added a comment - linking to LUCENE-2198 , when we replace some of these buggy stemfilters with snowball ones, we will need it to implement the protected word support already exposed by the relevant analyzers.
          Robert Muir made changes -
          Summary Remove duplicate analysis functionality Fix buggy stemmers and Remove duplicate analysis functionality
          Issue Type Task [ 3 ] Bug [ 1 ]
          Description would like to mark the following code deprecated, so it can be removed.

          * analyzers/fr: all except ElisionFilter, this is unrelated and standalone.
          * analyzers/nl:entire package
          * analyzers/ru: entire package

          below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
          I think we should delete all of this code in favor of the actual snowball package.


          {noformat}
          /**
           * A stemmer for French words.
           * <p>
           * The algorithm is based on the work of
           * Dr Martin Porter on his snowball project<br>
           * refer to http://snowball.sourceforge.net/french/stemmer.html&lt;br>
           * (French stemming algorithm) for details
           * </p>
           */

          public class FrenchStemmer {

          /**
           * A stemmer for Dutch words.
           * <p>
           * The algorithm is an implementation of
           * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
           * algorithm in Martin Porter's snowball project.
           * </p>
           */
          public class DutchStemmer {

          /**
           * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
           */
          class RussianStemmer
          {noformat}

          would like to remove stemmers in the following packages, and instead in their analyzers use a SnowballStemFilter instead.

          * analyzers/fr
          * analyzers/nl
          * analyzers/ru

          below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
          I think we should delete all of this custom stemming code in favor of the actual snowball package.


          {noformat}
          /**
           * A stemmer for French words.
           * <p>
           * The algorithm is based on the work of
           * Dr Martin Porter on his snowball project<br>
           * refer to http://snowball.sourceforge.net/french/stemmer.html&lt;br>
           * (French stemming algorithm) for details
           * </p>
           */

          public class FrenchStemmer {

          /**
           * A stemmer for Dutch words.
           * <p>
           * The algorithm is an implementation of
           * the <a href="http://snowball.tartarus.org/algorithms/dutch/stemmer.html">dutch stemming</a>
           * algorithm in Martin Porter's snowball project.
           * </p>
           */
          public class DutchStemmer {

          /**
           * Russian stemming algorithm implementation (see http://snowball.sourceforge.net for detailed description).
           */
          class RussianStemmer
          {noformat}

          Hide
          DM Smith added a comment -

          I think it is right to fix bad behavior, but such a change is not bw compat. It will require an index rebuild.

          I'm happy with the direction that non-english work is going. I'm hoping that once it is solid that the bw compat policy will be strict as core. (whatever that means For any application of Lucene that handles many different languages, it is critical that this "stuff" is stable and solid.

          Show
          DM Smith added a comment - I think it is right to fix bad behavior, but such a change is not bw compat. It will require an index rebuild. I'm happy with the direction that non-english work is going. I'm hoping that once it is solid that the bw compat policy will be strict as core. (whatever that means For any application of Lucene that handles many different languages, it is critical that this "stuff" is stable and solid.
          Hide
          Robert Muir added a comment -

          Hello DM, me marking this as a bug does not mean it will be a backwards incompatible fix, i have not even proposed a patch yet.

          This is undeniably a bug, each stemmer proudly lists that it implements the snowball algorithm, but it is not correct.
          it is my understanding that such problems (buggy stemming impls) are the reason the snowball project was created in the first place

          So, we can fix the bug in 2 different ways:

          • delete the old stemmers and in the analyzers replace them with SnowballStemFilters (it does fix the bug, as they now become correct)
          • keep the buggy code and behavior with version
          Show
          Robert Muir added a comment - Hello DM, me marking this as a bug does not mean it will be a backwards incompatible fix, i have not even proposed a patch yet. This is undeniably a bug, each stemmer proudly lists that it implements the snowball algorithm, but it is not correct. it is my understanding that such problems (buggy stemming impls) are the reason the snowball project was created in the first place So, we can fix the bug in 2 different ways: delete the old stemmers and in the analyzers replace them with SnowballStemFilters (it does fix the bug, as they now become correct) keep the buggy code and behavior with version
          Hide
          Robert Muir added a comment -

          I also wanted to comment here regarding further duplicate analysis, we can look at these in a later issue if we want

          • BrazilianStemmer looks suspiciously like the Snowball Portuguese algorithm except with different diacritics handling, need to look further
          • ChineseAnalyzer (the one that does individual chinese characters) does essentially what StandardAnalyzer does with chinese text, I do not see any other features
          Show
          Robert Muir added a comment - I also wanted to comment here regarding further duplicate analysis, we can look at these in a later issue if we want BrazilianStemmer looks suspiciously like the Snowball Portuguese algorithm except with different diacritics handling, need to look further ChineseAnalyzer (the one that does individual chinese characters) does essentially what StandardAnalyzer does with chinese text, I do not see any other features
          Hide
          Robert Muir added a comment -

          i'd like to work on getting these bugs fixed, but I'm not sure the best way to proceed.

          looking at the different possibilities i came up with two good options, although maybe there are other ways:

          • option 1, deprecate and keep the old broken impls and apis, but depending on Version use the correct ones instead: api and index back compat, but we keep the buggy code and support it for at least some time.
          • option 2, deprecate the old apis, but implement it in terms of the correct one: api back compat only, but we drop the buggy code so maintenance is easier
          Show
          Robert Muir added a comment - i'd like to work on getting these bugs fixed, but I'm not sure the best way to proceed. looking at the different possibilities i came up with two good options, although maybe there are other ways: option 1, deprecate and keep the old broken impls and apis, but depending on Version use the correct ones instead: api and index back compat, but we keep the buggy code and support it for at least some time. option 2, deprecate the old apis, but implement it in terms of the correct one: api back compat only, but we drop the buggy code so maintenance is easier
          Hide
          Robert Muir added a comment -

          apologies for the large patch.

          this patch does the following:

          • deprecates RussianTokenizer, RussianStemmer, RussianStemFilter, DutchStemmer, DutchStemFilter, FrenchStemmer, FrenchStemFilter
          • use snowball in the above analyzers instead, depending upon version.
          • doesn't deprecate germanstemmer, but uses snowball instead (which is maintained and relevance-tested and supports things like u+umlaut = ue, etc). the old stemmer is kept because it is a different algorithm (alternate).
          • the dutchstemmer had 'dictionary based stemming override' support, so to implement this, add StemmerOverrideFilter which does this in a generic way with KeywordAttribute
          • adds KeywordAttribute support to SnowballFilter
          • deprecates SnowballAnalyzer in favor of language-specific analyzers.
          • adds Romanian and Turkish stopword lists, since snowball is missing them.
          • implements language-specific analyzers in place of all the ones snowball tried to do at once before.
          Show
          Robert Muir added a comment - apologies for the large patch. this patch does the following: deprecates RussianTokenizer, RussianStemmer, RussianStemFilter, DutchStemmer, DutchStemFilter, FrenchStemmer, FrenchStemFilter use snowball in the above analyzers instead, depending upon version. doesn't deprecate germanstemmer, but uses snowball instead (which is maintained and relevance-tested and supports things like u+umlaut = ue, etc). the old stemmer is kept because it is a different algorithm (alternate). the dutchstemmer had 'dictionary based stemming override' support, so to implement this, add StemmerOverrideFilter which does this in a generic way with KeywordAttribute adds KeywordAttribute support to SnowballFilter deprecates SnowballAnalyzer in favor of language-specific analyzers. adds Romanian and Turkish stopword lists, since snowball is missing them. implements language-specific analyzers in place of all the ones snowball tried to do at once before.
          Robert Muir made changes -
          Attachment LUCENE-2055.patch [ 12434371 ]
          Hide
          Robert Muir added a comment -

          here is a short explanation of what i figure might be the controversial part: adding all the language-specific analyzers:

          I think its too difficult for a non-english user to use lucene.
          Let's take the romanian case, sure its supported by SnowballAnalyzer, but:

          • where are the stopwords? if the user is smart enough they can google this and find savoy's list... but it contains some stray nouns that should not be in there, and will they get the encoding correct?
          • for some languages: french, dutch, turkish: we already want to do something different already. For french we need the elision filter to tokenize correctly, for dutch, the special dictionary-based exclusions (I have been told by some any stemmer that does not handle fiets correct is useless), for turkish we need the special lowercasing.
          • for other languages: german, swedish, ... i think we REALLY want to implement decompounding support in the future. For german at least, there is a public domain wordlist just itching to be used for this.
          • oh yeah, and all the javadocs are in english, so writing your own analyzer is another barrier to entry.

          So I think instead its best to have a "recommended default" organized by language, preferably one we have relevance tested / or is already published. many of the existing snowball stemmers have published relevance results available already, thus my bias towards them. Sure it won't meet everyones needs, and users should still think about using them as a template, but I think digging up your own stoplist / writing your own analyzer, figuring out your language support is really buried in snowball, combined with documentation not in your native tongue, i think this adds up to a barrier to entry that is simply too high.

          Show
          Robert Muir added a comment - here is a short explanation of what i figure might be the controversial part: adding all the language-specific analyzers: I think its too difficult for a non-english user to use lucene. Let's take the romanian case, sure its supported by SnowballAnalyzer, but: where are the stopwords? if the user is smart enough they can google this and find savoy's list... but it contains some stray nouns that should not be in there, and will they get the encoding correct? for some languages: french, dutch, turkish: we already want to do something different already. For french we need the elision filter to tokenize correctly, for dutch, the special dictionary-based exclusions (I have been told by some any stemmer that does not handle fiets correct is useless), for turkish we need the special lowercasing. for other languages: german, swedish, ... i think we REALLY want to implement decompounding support in the future. For german at least, there is a public domain wordlist just itching to be used for this. oh yeah, and all the javadocs are in english, so writing your own analyzer is another barrier to entry. So I think instead its best to have a "recommended default" organized by language, preferably one we have relevance tested / or is already published. many of the existing snowball stemmers have published relevance results available already, thus my bias towards them. Sure it won't meet everyones needs, and users should still think about using them as a template, but I think digging up your own stoplist / writing your own analyzer, figuring out your language support is really buried in snowball, combined with documentation not in your native tongue, i think this adds up to a barrier to entry that is simply too high.
          Hide
          Uwe Schindler added a comment -

          Several bugs:

          • StemOverrideFilter should be final or have final incrementToken()
          • The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap
          Show
          Uwe Schindler added a comment - Several bugs: StemOverrideFilter should be final or have final incrementToken() The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap
          Hide
          Robert Muir added a comment -

          StemOverrideFilter should be final or have final incrementToken()

          ok, thanks, will fix this now.

          The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap

          No. Instead i will force it to be case sensitive to ignore this, it is stilly to have dutch stem filter be horribly slow because of some theoretical case like this.

          Show
          Robert Muir added a comment - StemOverrideFilter should be final or have final incrementToken() ok, thanks, will fix this now. The usage of mixed CharArraySet and a conventional dictionary map is buggy. The CAS is using a different contains algo and lowercasing, you can get a hit in the CAS but the Map returns null -> NPE. I would not use a CAS for the beginning and always cast to String for now and I will open an issue for extending CAS to be a CharArrayMap No. Instead i will force it to be case sensitive to ignore this, it is stilly to have dutch stem filter be horribly slow because of some theoretical case like this.
          Hide
          Robert Muir added a comment -

          make StemmerOverrideFilter final, and hardcode it as case-sensitive (it makes sense to come after some sort of lowercasefilter anyway)

          Show
          Robert Muir added a comment - make StemmerOverrideFilter final, and hardcode it as case-sensitive (it makes sense to come after some sort of lowercasefilter anyway)
          Robert Muir made changes -
          Attachment LUCENE-2055.patch [ 12434379 ]
          Hide
          Robert Muir added a comment -

          updated patch:

          • implement StemmerOverrideFilter with CharArrayMap
          • fix casing problems in french and dutch: they did not call lowercasefilter, instead relying upon the stemmer to lowercase things. this causes inconsistencies with stopwords, dictionary-based stemming, exclusion sets, you name it. the old broken behavior is preserved depending on Version
          • add missing standardfilter to greek (depending on Version).
          Show
          Robert Muir added a comment - updated patch: implement StemmerOverrideFilter with CharArrayMap fix casing problems in french and dutch: they did not call lowercasefilter, instead relying upon the stemmer to lowercase things. this causes inconsistencies with stopwords, dictionary-based stemming, exclusion sets, you name it. the old broken behavior is preserved depending on Version add missing standardfilter to greek (depending on Version).
          Robert Muir made changes -
          Attachment LUCENE-2055.patch [ 12434678 ]
          Hide
          Uwe Schindler added a comment -

          Map<?,? extends String> does not make sense as String is final. Map<?,String> and the same for CharArrayMap<String>.

          Show
          Uwe Schindler added a comment - Map<?,? extends String> does not make sense as String is final. Map<?,String> and the same for CharArrayMap<String>.
          Hide
          Robert Muir added a comment -

          updated patch for the generics policeman

          Show
          Robert Muir added a comment - updated patch for the generics policeman
          Robert Muir made changes -
          Attachment LUCENE-2055.patch [ 12434694 ]
          Hide
          Simon Willnauer added a comment -

          Robert, nice work!
          I have one comment on StemmerOverrideFilter

          The ctor should not always copy the given dictionary dictionary - if is created with such a map we should use the given instance. This is similar to StopFilter vs. StopAnalyzer.
          Maybe a CharArrayMap.castOrCopy(Map<?, String>) would be handy in that case.

          One minor thing, the null check in DutchAnalyzer seems to be unnecessary but anyway thats fine.

                 if (stemdict != null && !stemdict.isEmpty())
          

          DutchAnalyzer also has an unused import

          import java.util.Arrays;
          

          except of those +1 from my side

          Show
          Simon Willnauer added a comment - Robert, nice work! I have one comment on StemmerOverrideFilter The ctor should not always copy the given dictionary dictionary - if is created with such a map we should use the given instance. This is similar to StopFilter vs. StopAnalyzer. Maybe a CharArrayMap.castOrCopy(Map<?, String>) would be handy in that case. One minor thing, the null check in DutchAnalyzer seems to be unnecessary but anyway thats fine. if (stemdict != null && !stemdict.isEmpty()) DutchAnalyzer also has an unused import import java.util.Arrays; except of those +1 from my side
          Hide
          Robert Muir added a comment -

          patch addressing Simon's comments, and also fixing javadoc warnings.

          while I am here, remove other unused imports in contrib/analyzers.

          will commit in a day or two if no one objects.

          Show
          Robert Muir added a comment - patch addressing Simon's comments, and also fixing javadoc warnings. while I am here, remove other unused imports in contrib/analyzers. will commit in a day or two if no one objects.
          Robert Muir made changes -
          Attachment LUCENE-2055.patch [ 12434863 ]
          Hide
          Uwe Schindler added a comment -

          I will apply the patch here later and also test everything and look through all analyzers, but as far as I see, I am happy with it. The CharArrayMap code is still ok (if the cast compiles without unchecked warning, not yet checked) - so far from generic police

          +1 also on having separate "default analyzer" classes for each language.

          Show
          Uwe Schindler added a comment - I will apply the patch here later and also test everything and look through all analyzers, but as far as I see, I am happy with it. The CharArrayMap code is still ok (if the cast compiles without unchecked warning, not yet checked) - so far from generic police +1 also on having separate "default analyzer" classes for each language.
          Robert Muir made changes -
          Assignee Robert Muir [ rcmuir ]
          Hide
          Robert Muir added a comment -

          i will commit this monster soon if no one objects.

          Show
          Robert Muir added a comment - i will commit this monster soon if no one objects.
          Hide
          Uwe Schindler added a comment -

          +1, gogogo

          Show
          Uwe Schindler added a comment - +1, gogogo
          rmuir committed 907125 (97 files)
          Reviews: none

          LUCENE-2055: better snowball integration, deprecate buggy handcoded snowball impls, restructure lang support

          Lucene trunk
          Hide
          Robert Muir added a comment -

          Committed revision 907125. Thanks to the reviews and help Simon/Uwe!

          Show
          Robert Muir added a comment - Committed revision 907125. Thanks to the reviews and help Simon/Uwe!
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Robert Muir added a comment -

          reopening for possible 2.9.4/3.0.3 backport.

          (i just want to add documentation recommending to use the snowball stemmers instead)

          Show
          Robert Muir added a comment - reopening for possible 2.9.4/3.0.3 backport. (i just want to add documentation recommending to use the snowball stemmers instead)
          Robert Muir made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Robert Muir made changes -
          Fix Version/s 2.9.4 [ 12315148 ]
          Fix Version/s 3.0.3 [ 12315147 ]
          Fix Version/s 3.1 [ 12314822 ]
          rmuir committed 1028782 (28 files)
          Reviews: none

          LUCENE-2055: add documentation notes about buggy stemmers

          Lucene lucene_2_9
          Hide
          Robert Muir added a comment -

          I committed some documentation that if possible, its better to use the snowball versions for Dutch and French.

          Committed revision 1028779 to 3.0.x
          Committed revision 1028782 to 2.9.x

          Show
          Robert Muir added a comment - I committed some documentation that if possible, its better to use the snowball versions for Dutch and French. Committed revision 1028779 to 3.0.x Committed revision 1028782 to 2.9.x
          Robert Muir made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Uwe Schindler made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mark Thomas made changes -
          Workflow jira [ 12481775 ] Default workflow, editable Closed status [ 12564013 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564013 ] jira [ 12585485 ]
          Shai Erera made changes -
          Component/s modules/analysis [ 12310230 ]
          Component/s contrib/analyzers [ 12312333 ]
          Gavin made changes -
          Link This issue depends on LUCENE-2206 [ LUCENE-2206 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-2206 [ LUCENE-2206 ]
          Gavin made changes -
          Link This issue depends on LUCENE-2198 [ LUCENE-2198 ]
          Gavin made changes -
          Link This issue depends upon LUCENE-2198 [ LUCENE-2198 ]

            People

            • Assignee:
              Robert Muir
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development