[OPENNLP-1353] DictonaryLemmatizer missing charset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.9.3
Fix Version/s: 2.0.0
Component/s: Lemmatizer
Labels:
None
Environment:
Windows 10

Language:
- Java

Description

The initialization of the DictonaryLemmatizer is not decoding the inputstream correctly due to missing charset.

My dictionary file for the lemmatizer is utf-8 encoded. At DictonaryLemmatizer initialization the system fallback encoding is used because no charset is specified for the InputStreamReader. In my case windows-1252. This leads to the problem that the correct lemmas of words are not found.

E.g. My lemma.dict file contains following line (utf-8):

mäuse      NN     maus   //German word of mice

And the InputStreamReader decodes it as windows-1252:

mÃ¤use    NN    maus

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Robert

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Jan/22 15:35

Updated:: 07/Jun/22 11:58

Resolved:: 07/Jun/22 11:58

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified