Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
2.3.0
-
None
Description
While investigating OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002
I ran: opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt
When I checked the output corpus (txt) file, I noticed incorrect symbols being written there.
A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent).
Therefore, Conll02NameSampleStream needs a fix to read the original files in ISO_8859_1.
With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file.