Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1353

DictonaryLemmatizer missing charset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.9.3
    • 2.0.0
    • Lemmatizer
    • None
    • Windows 10

    Description

      The initialization of the DictonaryLemmatizer is not decoding the inputstream correctly due to missing charset.

      My dictionary file for the lemmatizer is utf-8 encoded. At DictonaryLemmatizer initialization the system fallback encoding is used because no charset is specified for the InputStreamReader. In my case windows-1252. This leads to the problem that the correct lemmas of words are not found.

      E.g. My lemma.dict file contains following line (utf-8):

      mäuse      NN     maus   //German word of mice
      

      And the InputStreamReader decodes it as windows-1252:

      mäuse    NN    maus
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            Wenig Robert
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified