[CTAKES-189] GSoC: Implement OCR/Tika to standardize text input for cTAKES - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0-incubating
Fix Version/s: future enhancement
Component/s: None
Labels:
- gsoc
- gsoc2013

Description

I am proposing to have a component in cTAKES that is capable of taking in various types of content (PDF, Scanned JPG's, Word, XLS, TXT, etc.), extracting the text content before passing it on to cTAKES for NLP processing.
There are currently open source libraries such as TIKA, JavaOCR as a starting point but I have not found a centralized lib that also incorporates all of the above including OCR into the flow easily.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Gui.java
30/Sep/13 16:14
2 kB
sandeep_hub

Activity

People

Assignee:: Unassigned

Reporter:: Pei Chen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Apr/13 14:20

Updated:: 30/Jun/14 18:35