[CTAKES-520] SentenceDetectorAnnotatorBIO token scanning performance issues - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 4.0.0
Fix Version/s: None
Component/s: ctakes-core
Labels:
None

Description

SentenceDetectorAnnotatorBIO iterates over every character in the Segment and classifies it as Begin, Inside, or Outside a Sentence. When doing this, it needs to know the next and previous token from the current character.

It currently finds these tokens afresh for each character. That means that it starts from the current character, and scans forward and backwards looking for whitespace until it finds the boundaries of the tokens either side of the current position. This is very wasteful; when the current index steps within a word, the tokens do not change since we're still within the same word. Also, since we're scanning in one direction, we never need to scan for the previous token, because we already know it.

(I found this bug with a pathological case where I had a "document" with a single word that was a megabyte long. In a case where the word length is not bounded, the current algorithm is quadratic instead of linear, because it scans the length of the word for each character.)

Patch attached. This fixes the problem by keeping track of the word boundary, and only scanning for the next token when we have reached the boundary of the current one. Also, the previous token is simply taken as the token from the previous iteration, and the token features are only recomputed when the token changes.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CTAKES-520.patch
16/Aug/18 22:29
5 kB
Ewan Mellor

Activity

People

Assignee:: Tim Miller

Reporter:: Ewan Mellor

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 16/Aug/18 21:18

Updated:: 20/Dec/18 18:01