[CTAKES-105] Add Apache Tika integration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: future enhancement
Component/s: None
Labels:
None

Description

Would be nice to add in a util/pre-processor to intake any document type (scanned pdf, image, word, pdf, xls, etc.), have something like Apache Tika automatically detect the type, OCR, extract the plain-text, and then feed it to the pipeline.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pei Chen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Nov/12 17:18

Updated:: 27/Nov/12 17:18