Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2174

Too few formats in support declared by TesseractOCRParser

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.14
    • 1.15, 2.0.0
    • parser
    • None

    Description

      A complete install of Leptonica with Tesseract will add support for formats that are not declared by TesseractOCRParser. These include JP2, JPX and PPM.

      Tesseract produces OCR output fine for JPX images as of this version:

        $ tesseract -v
           tesseract 3.04.01
             leptonica-1.73
               libjpeg 8d : libpng 1.6.26 : libtiff 4.0.6 : zlib 1.2.5}}
      

      However, these types are not declared by getSupportTypes so no output is produced for PDFs which contained JPX images of scanned documents, for example.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mcaruanagalizia Matthew Caruana Galizia
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: