Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.9.3
-
None
-
Windows 10
Description
The initialization of the DictonaryLemmatizer is not decoding the inputstream correctly due to missing charset.
My dictionary file for the lemmatizer is utf-8 encoded. At DictonaryLemmatizer initialization the system fallback encoding is used because no charset is specified for the InputStreamReader. In my case windows-1252. This leads to the problem that the correct lemmas of words are not found.
E.g. My lemma.dict file contains following line (utf-8):
mäuse NN maus //German word of mice
And the InputStreamReader decodes it as windows-1252:
mäuse NN maus