Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1703

Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 1.11
    • Component/s: parser
    • Labels:
      None

      Description

      If a user specifies the path to the Tesseract executable using TesseractOCRConfig.setTesseractPath, then Tika will assume that the Tesseract config folder (usually referred to as the 'tessdata' folder) is in the same location. This is usually true in a Windows environment, where everything is installed into a central location.

      However, this is not necessarily the case in a Linux environment. If one were to build Tesseract from source, for example, the config folder will be installed in a different location than the Tesseract executable.

      One way to fix this would be to add a way to specify the location of the Tesseract config folder separate from the path to the executable.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user taidan19 opened a pull request:

          https://github.com/apache/tika/pull/56

          TIKA-1703 Add ability to specify Tesseract config path.

          Link to Jira ticket - https://issues.apache.org/jira/browse/TIKA-1703

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/taidan19/tika TIKA-1703

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/56.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #56


          commit 86e8fdf187af5051812e1164c4cc3fef737a0644
          Author: Christian Wolfe <taidan19@gmail.com>
          Date: 2015-08-04T00:54:23Z

          TIKA-1703 Add ability to specify Tesseract config path.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user taidan19 opened a pull request: https://github.com/apache/tika/pull/56 TIKA-1703 Add ability to specify Tesseract config path. Link to Jira ticket - https://issues.apache.org/jira/browse/TIKA-1703 You can merge this pull request into a Git repository by running: $ git pull https://github.com/taidan19/tika TIKA-1703 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/56.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #56 commit 86e8fdf187af5051812e1164c4cc3fef737a0644 Author: Christian Wolfe <taidan19@gmail.com> Date: 2015-08-04T00:54:23Z TIKA-1703 Add ability to specify Tesseract config path.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/56

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/56
          Hide
          chrismattmann Chris A. Mattmann added a comment -

          Thanks Christian Wolfe applied in r1694133!

          Show
          chrismattmann Chris A. Mattmann added a comment - Thanks Christian Wolfe applied in r1694133!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #812 (See https://builds.apache.org/job/tika-trunk-jdk1.7/812/)
          Fix for TIKA-1703: Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path Contributed by Christian Wolfe <taidan19@gmail.com> this closes #56. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1694133)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java
          • /tika/trunk/tika-parsers/src/test/resources/test-properties/TesseractOCRConfig-full.properties
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #812 (See https://builds.apache.org/job/tika-trunk-jdk1.7/812/ ) Fix for TIKA-1703 : Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path Contributed by Christian Wolfe <taidan19@gmail.com> this closes #56. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1694133 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java /tika/trunk/tika-parsers/src/test/resources/test-properties/TesseractOCRConfig-full.properties

            People

            • Assignee:
              chrismattmann Chris A. Mattmann
              Reporter:
              taidan19 Christian Wolfe
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development