Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6254

Dictionary-based lemmatizer

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New

      Description

      The only way to achieve lemmatization today is to use the SynonymFilterFactory. The available stemmers are also inaccurate since they are only following simplistic rules.

      A dictionary-based lemmatizer will be more precise because it has the opportunity to know the part of speech. Thus it provides a more precise method to stem words compared to other dictionary-based stemmers such as Hunspell.

      This is my effort to develop such a lemmatizer for Apache Lucene. The documentation is temporarily placed here:
      http://folk.uio.no/erlendfg/solr/lemmatizer.html

        Attachments

        1. LUCENE-6254.patch
          32 kB
          Erlend Garåsen

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              erlendfg Erlend Garåsen
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: