Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1512

Fix incorrect encoding used in Conll02NameSampleStream

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.3.0
    • 2.3.1
    • Formats, Name Finder
    • None

    Description

      While investigating OPENNLP-1190, I tested the example from the OpenNLP documentation to convert the Esp.train example to the OpenNLP format, see: https://opennlp.apache.org/docs/2.3.0/manual/opennlp.html#tools.corpora.conll.2002

      I ran: opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt

      When I checked the output corpus (txt) file, I noticed incorrect symbols being written there.

      A quick debugging session revealed that the original files where ISO_8859_1 encoded. However, in line 94 of Conll02NameSampleStream, UTF-8 encoding was assumed. This results in accents or other special symbols of the spanish alphabet being converted to garbage in the resulting UTF-8 encoded file (reason: input character-set interpretation inconsistent).

      Therefore, Conll02NameSampleStream needs a fix to read the original files in ISO_8859_1.

      With this measure in place, the accents á, é, ... are correctly written to the resulting converted training corpus file. 

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mawiesne Martin Wiesner
            mawiesne Martin Wiesner
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment