The current implementation of the cTAKES PTB tokenizer outputs newline tokens, but the OpenNLP tokenizers don't support this yet.
There are two ways of supporting this:
- Only output the tokens without newline tokens and add the newline tokens in a second run, e.g. by a UIMA AE
- Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>, or a span with this as the type)