Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2093

Add hOCR output type to the TesseractOCRParser

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: ocr
    • Flags:
      Patch

      Description

      I've tweaked the TesseractOCRParser and TesseractOCRConfig to add the "txt" or "hocr" parameters that allows you to get specific outputs. There are also "pdf" and in the next version of Tesseract a "tsv" outputs, but didn't add support for those.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison@apache.org Tim Allison
                Reporter:
                epugh Eric Pugh
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: