[OPENNLP-1512] Fix incorrect encoding used in Conll02NameSampleStream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.1
Component/s: Formats, Name Finder
Labels:
None

Description

While investigating ~~OPENNLP-1190~~, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002

I ran: opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt

When I checked the output corpus (txt) file, I noticed incorrect symbols being written there.

A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent).

Therefore, Conll02NameSampleStream needs a fix to read the original files in ISO_8859_1.

With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file.

Attachments

Activity

People

Assignee:: Martin Wiesner

Reporter:: Martin Wiesner

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Sep/23 15:46

Updated:: 04/Sep/23 06:53

Resolved:: 04/Sep/23 06:53