[TIKA-1790] Enhancement for extracting text from pdfs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: example, parser
Labels:
- features
- newbie

Description

This enhancement would attempt to extract more text from multicolored background images in PDFs by using adaptive threshold binarization before applying Tesseract for OCR. It also tries to extract text from vector images inside PDFs by first rasterizing them (using Ghostscript) and then applying Tesseract to the flattened images. The final output would be a text file containing all previously extracted text.

I would want to integrate this as a separate library from Tika that is similar to how the GeoTopicParser is handled.

The code that I have is still a work in progress and can be found here.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Randal Moss

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Nov/15 16:41

Updated:: 09/Nov/15 16:41