Uploaded image for project: 'ManifoldCF'
  1. ManifoldCF
  2. CONNECTORS-1287

Additional TikaOCR Configuration Options

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • ManifoldCF 2.3
    • ManifoldCF next
    • Tika extractor
    • None

    Description

      For a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF does not provide configuration options to handle this. It would be nice to have these options for the Tika content extraction:

      1. Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
      2. Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

      Tika OCR is based on tesseract, an Open Source OCR library intially developed by Hewlett-Packard and later continued by Google. It is available from https://github.com/tesseract-ocr/tesseract . It needs to be installed with the tesseract binary available in the PATH environment variable - alternatively it can be set using an Tika API method. Once it is installed and Tika is configured correctly, it works like a charm.

      When indexing images or PDFs containing images instead of real text, OCR is necessary for making those documents searchable.

      Attachments

        Activity

          People

            kwright@metacarta.com Karl Wright
            konrad.holl Konrad Holl
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: