Uploaded image for project: 'cTAKES'
  1. cTAKES
  2. CTAKES-266

tokenizer creates empty tokens before contractions

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.1
    • Component/s: ctakes-core
    • Labels:
      None

      Description

      Normally contractions are tokenized as follows:

      don't = do + n't

      And the code in ContractionsPTB will create a WordToken for the do and a ContractionToken for the n't. (There is some special logic for n't.) There are some weird cases with n't with no preceding text. In my case it was some non-clinical text ("surf n'turf") but you can imagine typos as well (do n't). In these cases the preceding text is actually empty since it is the start of the token, and the code will create an empty WordToken, which can screw up downstream components (I noticed it in the parser). This can be fixed easily by checking for token length of 0 before creating the word token.

        Attachments

          Activity

            People

            • Assignee:
              tmill Tim Miller
              Reporter:
              tmill Tim Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: