Details

    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Tokenizer
    • Labels:

      Description

      Add Tokenizer based on Penn Tree Bank rules.

        Issue Links

          Activity

          Hide
          joern Joern Kottmann added a comment - - edited

          The current implementation of the cTAKES PTB tokenizer outputs newline tokens, but the OpenNLP tokenizers don't support this yet.

          There are two ways of supporting this:

          • Only output the tokens without newline tokens and add the newline tokens in a second run, e.g. by a UIMA AE
          • Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>, or a span with this as the type)
          Show
          joern Joern Kottmann added a comment - - edited The current implementation of the cTAKES PTB tokenizer outputs newline tokens, but the OpenNLP tokenizers don't support this yet. There are two ways of supporting this: Only output the tokens without newline tokens and add the newline tokens in a second run, e.g. by a UIMA AE Extend the OpenNLP tokenizer a bit and support layout tags (e.g. <NEWLINE>, or a span with this as the type)

            People

            • Assignee:
              Unassigned
              Reporter:
              chenpei Pei Chen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development