Uploaded image for project: 'cTAKES'
  1. cTAKES
  2. CTAKES-254

Apostrophe in contraction breaks TokenizerPTB

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.1
    • Component/s: ctakes-core
    • Labels:
      None

      Description

      Sample text: "on n'tion"
      The single char followed by apostrophe will break the TokenizerPTB.
      What the heck?
      Results in a OutOfBoundsException
      org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB.setNumPosition(TokenizerPTB.java 1147)

      Sean Finan already had a patch for this sometime ago, but just wanted to see if we missed something else here:
      See below to add a check for empty string in the token:
      Starting at line 1145:

      // START

      private void setNumPosition(WordToken wta, String tokenText) {
      if ( tokenText.isEmpty() )

      { // was getting ioobE from tokenText.charAt(..) // Possibilities like this (empty, null) should always be checked // - but I wonder that we get (want) empty tokens at all. // I believe that working with zero-length words is a bug, and this is not a fix it merely avoids a crash. wta.setNumPosition( TokenizerAnnotator.TOKEN_NUM_POS_NONE ); return; }

      if (isDigit(tokenText.charAt(0))) {

      // END

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                chenpei Pei Chen
                Reporter:
                chenpei Pei Chen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: