[TIKA-2174] Too few formats in support declared by TesseractOCRParser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.14
Fix Version/s: 1.15, 2.0.0
Component/s: parser
Labels:
None

Description

A complete install of Leptonica with Tesseract will add support for formats that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.

Tesseract produces OCR output fine for JPX images as of this version:

  $ tesseract -v
     tesseract 3.04.01
       leptonica-1.73
         libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}

However, these types are not declared by getSupportTypes so no output is produced for PDFs which contained JPX images of scanned documents, for example.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Matthew Caruana Galizia

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Nov/16 11:59

Updated:: 12/Apr/21 13:01

Resolved:: 09/Nov/16 18:57