Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8937

Avoid agressive stemming on numbers in the FrenchMinimalStemmer

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: main (9.0)
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Here is the discussion on the mailing list : http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser

      The light stemmer removes the last character of a word if the last two
      characters are identical.
      We can see that here:
      https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
      In this light stemmer, there is a check to avoid altering the token if the
      token is a number.

      The minimal stemmer also removes the last character of a word if the last
      two characters are identical.
      We can see that here:
      https://github.com/apache/lucene-solr/blob/813ca77/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77

      But in this minimal stemmer there is no check to see if the character is a
      letter or not.
      So when we have numeric tokens with the last two characters identical they
      are altered.

      For example "1234567899" will be stemmed as "123456789".

      It could be great of it's not altered.

      Here is the same issue for the LightStemmer : https://issues.apache.org/jira/browse/LUCENE-4063

        Attachments

        1. 0001-LUCENE-8937-Avoid-agressive-stemming-on-numbers-in-t.patch
          3 kB
          Adrien Gallou
        2. LUCENE-8937.patch
          0.1 kB
          Adrien Gallou

          Activity

            People

            • Assignee:
              tomoko Tomoko Uchida
              Reporter:
              agallou Adrien Gallou
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: