Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1512

Fix incorrect encoding used in Conll02NameSampleStream

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.3.0
    • 2.3.1
    • Formats, Name Finder
    • None

    Description

      While investigating OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002

      I ran: opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt

      When I checked the output corpus (txt) file, I noticed incorrect symbols being written there.

      A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent).

      Therefore, Conll02NameSampleStream needs a fix to read the original files in ISO_8859_1.

      With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file. 

      Attachments

        Activity

          People

            mawiesne Martin Wiesner
            mawiesne Martin Wiesner
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: