Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-862

BRAT format packages do not handle punctuation correctly when training NER model

Agile BoardAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.6.0
    • None
    • Formats
    • None

    Description

      BRAT does not require preprocessing of text files in order to add annotations to text documents. And this is great because I can feed documents from corpora I am given directly into BRAT. If I have a line such as:

      Residence: Athens, Georgia

      I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would generate the offset and everything would be fine.

      It appears though that I only get 1 entity correctly processed (and the other dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the comma is not separated from Athens. I have 789 annotated raw, non pre-processed text documents from past efforts. I believe that OpenNLP should be able to handle lines like the above in the case of the BRAT format code.

      It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is what creates Athens, as a token. I find that the SimpleTokenizer might perform better with BRAT through my limited testing of raw documents if the current general approach is held.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            joern Jörn Kottmann
            gwerner Gregory Werner

            Dates

              Created:
              Updated:

              Slack

                Issue deployment