Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
Description
TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is Orientation and script detection (OSD) only, meaning it does not perform OCR, just outputs orientation and script information.
An example usage of mode 0:
$ tesseract infile.png outfile --psm 0 -l osd
In this mode, the usual outfile.txt is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an outfile.osd file, like so:
Page 1 Warning. Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 212 Page number: 0 Orientation in degrees: 0 Rotate: 0 Orientation confidence: 13.73 Script: Latin Script confidence: 4.78
However, TesseractOCRParser#parse(...) is coded to only read the contents of outfile.txt (alternatively outfile.hocr) in all modes, so mode 0 outputs nothing regardless of input.
This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.
Attachments
Issue Links
- relates to
-
TIKA-2357 Allow Tesseract PSM up to 13
- Resolved
- links to