Uploaded image for project: 'cTAKES'
  1. cTAKES
  2. CTAKES-372

Penn TreeBank Tokenizer could use some attention

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: future enhancement
    • Fix Version/s: None
    • Component/s: ctakes-core

      Description

      The ptb tokenizer currently in use by ctakes has some inconsistencies. See https://issues.apache.org/jira/browse/CTAKES-371 It also does not seem to incorporate some of the clinical rules set out in http://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf

      Some major refactoring is also in order ... as are numerous test cases.

        Attachments

          Activity

            People

            • Assignee:
              james-masanz James Joseph Masanz
              Reporter:
              seanfinan Sean Finan
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: