[CTAKES-374] Scaleout of cTAKES pipeline - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: future enhancement
Fix Version/s: 3.2.1
Component/s: None
Labels:
None

Description

Currently, cTAKES can't be easily deployed in an asynchronous manner. UIMA components aren't serializable (and thus cTAKES' components as well). Would like to come up with better ways to allow cTAKES to be easily run in a distributed fashion.

For example, for processing a long document (e.g. 10+ pages), cTAKES would take a long time to process.

I would like to see a feature where we can partition the input to cTAKES, in a way that won't affect the cTAKES annotation performance, allowing us to process through a cluster running in distributed mode (e.g. Spark streaming cTAKES). And then recombine the results such that the word/phrase token positions will be sequentially ordered.

We have a simple implementation of the ClinicalPipelineFactory with Spark Streaming. Currently our initial attempt in partitioning is by paragraphs. For example, we are doing something like:
RDD.map(a_single_paragraph.process_in_ctakes())

I also wanted to see if there are any better ways of doing this.

Attachments

Issue Links

is related to

CTAKES-314 BigTop/Hadoop cTAKES integration

Open

Activity

People

Assignee:: Unassigned

Reporter:: Selina Chu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Aug/15 18:09

Updated:: 19/Aug/15 03:14