Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2093

Add hOCR output type to the TesseractOCRParser

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: ocr
    • Flags:
      Patch

      Description

      I've tweaked the TesseractOCRParser and TesseractOCRConfig to add the "txt" or "hocr" parameters that allows you to get specific outputs. There are also "pdf" and in the next version of Tesseract a "tsv" outputs, but didn't add support for those.

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          W00t! Thank you, again, for the PR!

          Show
          tallison@mitre.org Tim Allison added a comment - W00t! Thank you, again, for the PR!
          Hide
          epugh Eric Pugh added a comment - - edited

          BTW, just got to updating my project with the latest 1.14-SNAPSHOT, and the hOCR process is working great. Thanks for getting this patch in.

          Not sure who marks things "Resolved", but from my perspective, it's Resolved.

          Show
          epugh Eric Pugh added a comment - - edited BTW, just got to updating my project with the latest 1.14-SNAPSHOT, and the hOCR process is working great . Thanks for getting this patch in. Not sure who marks things "Resolved", but from my perspective, it's Resolved.
          Hide
          epugh Eric Pugh added a comment -

          Thanks for this, and the addition of the HOCRPassthroughHandler, I'll give it a test today, however I suspect this is exactly what I need.

          Show
          epugh Eric Pugh added a comment - Thanks for this, and the addition of the HOCRPassthroughHandler, I'll give it a test today, however I suspect this is exactly what I need.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #148 (See https://builds.apache.org/job/tika-2.x/148/)
          TIKA-2093- Add Tesseract's hOCR output format as an option, via Eric (tallison: rev 673533d0e65b2b2613e19bbf952bdb352c628e52)

          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #148 (See https://builds.apache.org/job/tika-2.x/148/ ) TIKA-2093 - Add Tesseract's hOCR output format as an option, via Eric (tallison: rev 673533d0e65b2b2613e19bbf952bdb352c628e52) (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #52 (See https://builds.apache.org/job/tika-2.x-windows/52/)
          TIKA-2093- Add Tesseract's hOCR output format as an option, via Eric (tallison: rev 673533d0e65b2b2613e19bbf952bdb352c628e52)

          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #52 (See https://builds.apache.org/job/tika-2.x-windows/52/ ) TIKA-2093 - Add Tesseract's hOCR output format as an option, via Eric (tallison: rev 673533d0e65b2b2613e19bbf952bdb352c628e52) (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1106 (See https://builds.apache.org/job/Tika-trunk/1106/)
          TIKA-2093 – add option for Tesseract's hOCR output, thanks to Eric (tallison: rev 3a5431e200056d85b458bea766fd185225771c97)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1106 (See https://builds.apache.org/job/Tika-trunk/1106/ ) TIKA-2093 – add option for Tesseract's hOCR output, thanks to Eric (tallison: rev 3a5431e200056d85b458bea766fd185225771c97) (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java (edit) CHANGES.txt (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Eric Pugh, I made a few modifications. The biggest was parsing hocr and passing on the relevant elements to Tika's xhtml handler. The rest was in great shape.

          Once I hear back that this will work for your use case with my mods, I'll resolve this ticket.

          Thank you for the PR!

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Eric Pugh , I made a few modifications. The biggest was parsing hocr and passing on the relevant elements to Tika's xhtml handler. The rest was in great shape. Once I hear back that this will work for your use case with my mods, I'll resolve this ticket. Thank you for the PR!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/133

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/133
          Hide
          tallison@mitre.org Tim Allison added a comment -

          On mobile, can't do full review. If hocr output is xhtml, we'll prob want to parse it and transfer elements to the tika handler; otherwise we'll have a blob of encoded xhtml inside our xhtml. May misunderstand tho...

          Show
          tallison@mitre.org Tim Allison added a comment - On mobile, can't do full review. If hocr output is xhtml, we'll prob want to parse it and transfer elements to the tika handler; otherwise we'll have a blob of encoded xhtml inside our xhtml. May misunderstand tho...
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user epugh opened a pull request:

          https://github.com/apache/tika/pull/133

          add hOCR output format to TesseractParser TIKA-2093

          Small change to Tesseract OCR code to add the hOCR outputType. In the future we can add `pdf` and `tsv` as output types as well.

          First patch to Tika, please provide feedback!

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/epugh/tika feature/hocr_osr_support

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/133.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #133


          commit 10507d0521a0f06c50f32aa6150228ef4ac773d4
          Author: Eric Pugh <epugh@o19s.com>
          Date: 2016-09-22T17:14:55Z

          add hOCR output format to TesseractParser


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user epugh opened a pull request: https://github.com/apache/tika/pull/133 add hOCR output format to TesseractParser TIKA-2093 Small change to Tesseract OCR code to add the hOCR outputType. In the future we can add `pdf` and `tsv` as output types as well. First patch to Tika, please provide feedback! You can merge this pull request into a Git repository by running: $ git pull https://github.com/epugh/tika feature/hocr_osr_support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/133.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #133 commit 10507d0521a0f06c50f32aa6150228ef4ac773d4 Author: Eric Pugh <epugh@o19s.com> Date: 2016-09-22T17:14:55Z add hOCR output format to TesseractParser

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              epugh Eric Pugh
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development