Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2696

Support output of Tesseract OSD output for psm mode 0

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ocr
    • Labels:
      None

      Description

      TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is Orientation and script detection (OSD) only, meaning it does not perform OCR, just outputs orientation and script information.

      An example usage of mode 0:

      $ tesseract infile.png outfile --psm 0 -l osd
      

      In this mode, the usual outfile.txt is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an outfile.osd file, like so:

      Page 1
      Warning. Invalid resolution 0 dpi. Using 70 instead.
      Estimating resolution as 212
      Page number: 0
      Orientation in degrees: 0
      Rotate: 0
      Orientation confidence: 13.73
      Script: Latin
      Script confidence: 4.78
      

      However, TesseractOCRParser#parse(...) is coded to only read the contents of outfile.txt (alternatively outfile.hocr) in all modes, so mode 0 outputs nothing regardless of input.

      This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                4U6U57 August Valera
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: