Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2696

Support output of Tesseract OSD output for psm mode 0

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.2.0
    • ocr
    • None

    Description

      TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is Orientation and script detection (OSD) only, meaning it does not perform OCR, just outputs orientation and script information.

      An example usage of mode 0:

      $ tesseract infile.png outfile --psm 0 -l osd
      

      In this mode, the usual outfile.txt is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an outfile.osd file, like so:

      Page 1
      Warning. Invalid resolution 0 dpi. Using 70 instead.
      Estimating resolution as 212
      Page number: 0
      Orientation in degrees: 0
      Rotate: 0
      Orientation confidence: 13.73
      Script: Latin
      Script confidence: 4.78
      

      However, TesseractOCRParser#parse(...) is coded to only read the contents of outfile.txt (alternatively outfile.hocr) in all modes, so mode 0 outputs nothing regardless of input.

      This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.

      Attachments

        Issue Links

          Activity

            People

              tallison Tim Allison
              4U6U57 August Valera
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: