Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2093

Add hOCR output type to the TesseractOCRParser

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.13
    • 1.14
    • ocr
    • Patch

    Description

      I've tweaked the TesseractOCRParser and TesseractOCRConfig to add the "txt" or "hocr" parameters that allows you to get specific outputs. There are also "pdf" and in the next version of Tesseract a "tsv" outputs, but didn't add support for those.

      Attachments

        Issue Links

          Activity

            People

              tallison Tim Allison
              epugh Eric Pugh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: