[PDFBOX-1912] Optical Character Recognition (OCR) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: Text extraction
Labels:
- gsoc2014
Environment:
JDK 6, C/C++

Description

Brief explanation: The PDFBox library is widely used to extract text from PDF files. However, many PDF files embed text in a malformed manner which renders text extraction useless. There has recently been interest in extracting governmental data from PDF files, the PDF Liberation commons being a notable example, see https://github.com/pdfliberation for more details.

Many end-users of PDFBox have been making use of OCR tools such as Google's Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final image generated by PDFBox. We think that by adding a more integrated OCR API to PDFBox it will be possible to do a better job. PDFBox often has access to encoding and positioning information for individual glyphs. Even when their extracted text is meaningless, a character-by-character, or line-by-line OCR could be more accurate. PDFBox also has information such as image orientation which could allow it to better perform OCR on pages such as embedded landscape tables.

There are existing JNI bindings for Tesseract available at https://code.google.com/p/tesseract-android-tools/

Expected results: To extend PDF box with an API which allows external OCR tools to be plugged-in, and an implementation of a Tesseract plug-in using either JNI or the command line via Process.exec.

Knowledge Prerequisite: Java, JNI (C/C++)

Mentor: John Hewson

PMC Note: Tesseract is under the Apache License 2.0

To learn more about PDFBox, please visit http://pdfbox.apache.org/

Attachments

Issue Links

relates to

TIKA-1994 Integrate OCR with PDFParser

Resolved

Activity

People

Assignee:: John Hewson

Reporter:: John Hewson

Votes:: 3 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 11/Feb/14 20:50

Updated:: 31/Oct/20 14:49