[CTAKES-227] Broca's -> PunctuationToken instead of ContractionToken - caused by apostrophe seen as sentence ending - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: ctakes-core
Labels:
None

Description

The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn’t.

The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote) followed immediately by a newline

It has >100K occurrences of 's

It has >175K occurrences of the ' character in all.

The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test.

The word "Broca's" used to have a ContractionToken but since a sentence is now ending on the apostrophe, the apostrophe is getting annotated as a PunctuationToken.

See more in the thread started at
http://markmail.org/message/wavipejszlspzo5u
including examples that split correctly and incorrectly.

Attachments

Activity

People

Assignee:: James Joseph Masanz

Reporter:: James Joseph Masanz

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 26/Aug/13 18:21

Updated:: 17/Sep/13 09:07

Resolved:: 17/Sep/13 09:07