Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2970

Configuring Tesseract for OCR of PDF via Tika Config is not working

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.22
    • Fix Version/s: 1.23
    • Component/s: ocr
    • Labels:
      None

      Description

      Based on TIKA-2705, I thought I could eliminate the use of the properties files for configuring PDF and OCR processing, and just use a tika-config.xml file.

      I believe I have a unit test that demonstrates that if you need to override the tesseract path for OCR, you end up always with the default Tesseract configuration, which leads to Tika throwing an error: https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328

      In stepping through the code, it seems like every time we consult the context:

      ```
      TesseractOCRConfig tesseractConfig =
      context.get(TesseractOCRConfig.class, DEFAULT_TESSERACT_CONFIG);
      ```
      We always get back the default. The context never has our customized TesseractOCRConfig! Despite the fact that when we load up the TikaConfig in the first case, I notice that we do create a TesseractOCRParser object WITH the various parameters...

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison Tim Allison
                Reporter:
                epugh David Eric Pugh
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: