Uploaded image for project: 'cTAKES'
  1. cTAKES
  2. CTAKES-227

Broca's -> PunctuationToken instead of ContractionToken - caused by apostrophe seen as sentence ending

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.0
    • Fix Version/s: 3.1.0
    • Component/s: ctakes-core
    • Labels:
      None

      Description

      The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn’t.

      The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote) followed immediately by a newline

      It has >100K occurrences of 's

      It has >175K occurrences of the ' character in all.

      The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test.

      The word "Broca's" used to have a ContractionToken but since a sentence is now ending on the apostrophe, the apostrophe is getting annotated as a PunctuationToken.

      See more in the thread started at
      http://markmail.org/message/wavipejszlspzo5u
      including examples that split correctly and incorrectly.

        Attachments

          Activity

            People

            • Assignee:
              james-masanz James Joseph Masanz
              Reporter:
              james-masanz James Joseph Masanz
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: