Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2174

Too few formats in support declared by TesseractOCRParser

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None

      Description

      A complete install of Leptonica with Tesseract will add support for formats that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.

      Tesseract produces OCR output fine for JPX images as of this version:

        $ tesseract -v
           tesseract 3.04.01
             leptonica-1.73
               libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
      

      However, these types are not declared by getSupportTypes so no output is produced for PDFs which contained JPX images of scanned documents, for example.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mcaruanagalizia Matthew Caruana Galizia
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: