Solr
  1. Solr
  2. SOLR-3231

Add the ability to KStemmer to preserve the original token when stemming

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 5.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      While using the PorterStemmer, I found that there were often times that it was far to aggressive in it's stemming. In my particular case it is unrealistic to provide a protected word list which captures all possible words which should not be stemmed. To avoid this I proposed a solution whereby we store the original token as well as the stemmed token so exact searches would always work. Based on discussions on the mailing list Ahmet Arslan, I believe the attached patch to KStemmer provides the desired capabilities through a configuration parameter. This largely is a copy of the org.apache.lucene.wordnet.SynonymTokenFilter.

      1. KStemFilter.patch
        4 kB
        Jamie Johnson

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          If I understand the patch, this patch just sets the tokenType attribute to "STEM" right?

          Show
          Ryan McKinley added a comment - If I understand the patch, this patch just sets the tokenType attribute to "STEM" right?
          Hide
          Jamie Johnson added a comment -

          this should (unless I messed it up which is possible) also produce a token for the original term. For instance if the term was "bricks" it should produce tokens for "bricks" and "brick". If that's not the case please let me know.

          TestKStemFilterFactory.java
          package org.apache.solr.analysis;
          
          import java.io.Reader;
          import java.io.StringReader;
          
          import org.apache.lucene.analysis.MockTokenizer;
          import org.apache.lucene.analysis.TokenStream;
          import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
          
          /**
           * Licensed to the Apache Software Foundation (ASF) under one or more
           * contributor license agreements.  See the NOTICE file distributed with
           * this work for additional information regarding copyright ownership.
           * The ASF licenses this file to You under the Apache License, Version 2.0
           * (the "License"); you may not use this file except in compliance with
           * the License.  You may obtain a copy of the License at
           *
           *     http://www.apache.org/licenses/LICENSE-2.0
           *
           * Unless required by applicable law or agreed to in writing, software
           * distributed under the License is distributed on an "AS IS" BASIS,
           * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           * See the License for the specific language governing permissions and
           * limitations under the License.
           */
          
          /**
           * Simple tests to ensure the kstem filter factory is working.
           */
          public class TestKStemFilterFactory extends BaseTokenTestCase {
            public void testStemming() throws Exception {
              Reader reader = new StringReader("bricks");
              KStemFilterFactory factory = new KStemFilterFactory();
              TokenStream stream = factory.create(new MockTokenizer(reader, MockTokenizer.WHITESPACE, false));
              assertTokenStreamContents(stream, new String[] { "bricks", "brick" }, new int[]{1, 0});
              
            }
          }
          
          

          That is what this tests right?

          Show
          Jamie Johnson added a comment - this should (unless I messed it up which is possible) also produce a token for the original term. For instance if the term was "bricks" it should produce tokens for "bricks" and "brick". If that's not the case please let me know. TestKStemFilterFactory.java package org.apache.solr.analysis; import java.io.Reader; import java.io.StringReader; import org.apache.lucene.analysis.MockTokenizer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License" ); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http: //www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Simple tests to ensure the kstem filter factory is working. */ public class TestKStemFilterFactory extends BaseTokenTestCase { public void testStemming() throws Exception { Reader reader = new StringReader( "bricks" ); KStemFilterFactory factory = new KStemFilterFactory(); TokenStream stream = factory.create( new MockTokenizer(reader, MockTokenizer.WHITESPACE, false )); assertTokenStreamContents(stream, new String [] { "bricks" , "brick" }, new int []{1, 0}); } } That is what this tests right?
          Hide
          Robert Muir added a comment -

          I don't think we should approach the problem this way: this is the
          same discussion as LUCENE-3415

          Show
          Robert Muir added a comment - I don't think we should approach the problem this way: this is the same discussion as LUCENE-3415
          Hide
          Jamie Johnson added a comment -

          Thanks Robert. I just read LUCENE-3415 and understand the approach. My biggest issue is I don't like having to create a separate field to do an exact search, this of course is based on the fact that I was burned by this so perhaps I am biased. It feels like the right thing to do from a user of the API would be to do the least destructive thing, but again I have a specific use case in mind and am not considering all other implications.

          Show
          Jamie Johnson added a comment - Thanks Robert. I just read LUCENE-3415 and understand the approach. My biggest issue is I don't like having to create a separate field to do an exact search, this of course is based on the fact that I was burned by this so perhaps I am biased. It feels like the right thing to do from a user of the API would be to do the least destructive thing, but again I have a specific use case in mind and am not considering all other implications.
          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jamie Johnson
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development