[TIKA-1994] Integrate OCR with PDFParser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.14, 2.0.0
Component/s: None
Labels:
None

Description

Users can now run OCR on individual images embedded inline in PDFs if they get the configuration right.

There are some drawbacks: 1) the text appears as an attachment if using the RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully rendered page instead of on the individual images (this is still tbd).

It might be useful to run OCR against each rendered page (instead of the component images).

Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912). This will allow us to experiment with strategies until the cleaner integration is available with PDFBox 2.1.

Attachments

Issue Links

is related to

PDFBOX-1912 Optical Character Recognition (OCR)

In Progress

relates to

TIKA-1995 Improve OCR Strategy options for the PDFParser

Open

Activity

People

Assignee:: Tim Allison

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Jun/16 15:27

Updated:: 12/Apr/21 12:59

Resolved:: 03/Jun/16 18:53