Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10248

Add SpanishPluralStemFilter

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 9.0
    • 9.1
    • modules/analysis
    • None
    • New

    Description

      We propose a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. Our goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

      In the following article we made a comparison of different Spanish Stemmers and use cases and which value adds our contribution

      Our Solution is an algorithmic approach Spanish rules for building plural forms
      based on rules defined in wikilengua

      Some characteristics:

      • Designed to stem just plural to singular form
      • Distinguishes between masculine and feminine forms
      • It will increase recall but precision can be reduced depending on the use case/information need
      • Stems plural words of foreign origin: i.e. complots, bits, punks, robots
      • Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
      • Support for special cases: i.e. yoes, clubes, itemes, faralaes
      • Use it when the distinction between singular and plural is not relevant but gender is relevant
      • Produces meaningful tokens in form of singular
        • Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              xavier.sanchez Xavier Sanchez Loro
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m