Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6254

Dictionary-based lemmatizer

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • 5.0
    • modules/analysis
    • New

    Description

      The only way to achieve lemmatization today is to use the SynonymFilterFactory. The available stemmers are also inaccurate since they are only following simplistic rules.

      A dictionary-based lemmatizer will be more precise because it has the opportunity to know the part of speech. Thus it provides a more precise method to stem words compared to other dictionary-based stemmers such as Hunspell.

      This is my effort to develop such a lemmatizer for Apache Lucene. The documentation is temporarily placed here:
      http://folk.uio.no/erlendfg/solr/lemmatizer.html

      Attachments

        1. LUCENE-6254.patch
          32 kB
          Erlend Garåsen

        Activity

          People

            Unassigned Unassigned
            erlendfg Erlend Garåsen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: