Details

      Description

      There is a regex in TesseractOCRConfig.setLanguage(String language) which attempts to validate the language being set. Unfortunately it does not allow you to set some languages that are valid for tesseract.

      For example:

      TesseractOCRConfig config = new TesseractOCRConfig();
      config.setLanguage("chi_tra");

      This throws an IllegalArgumentException because of the '_' in the language name. "chi_tra" is a valid tesseract language code.

      Need to update the regex to allow '_' character.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1182 (See https://builds.apache.org/job/Tika-trunk/1182/)
          TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage() (graham: rev 5c51534a5731dba0ed22bc04b7da9d95adfb6f50)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java
            TIKA-2231 – update changes.txt. This closes #147 (tallison: rev c978a1195b0b910f523468d51d73e54caba535c0)
          • (edit) CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1182 (See https://builds.apache.org/job/Tika-trunk/1182/ ) TIKA-2231 : Improved param validation of TesseractOCRConfig.setLanguage() (graham: rev 5c51534a5731dba0ed22bc04b7da9d95adfb6f50) (edit) tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java TIKA-2231 – update changes.txt. This closes #147 (tallison: rev c978a1195b0b910f523468d51d73e54caba535c0) (edit) CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x #201 (See https://builds.apache.org/job/tika-2.x/201/)
          TIKA-2231 - allow underscored language codes (e.g. "chi_tra") in (tallison: rev 9dbff6065cf202a5795effaaf7e953ae6f761fe3)

          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java
          • (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
          • (edit) CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x #201 (See https://builds.apache.org/job/tika-2.x/201/ ) TIKA-2231 - allow underscored language codes (e.g. "chi_tra") in (tallison: rev 9dbff6065cf202a5795effaaf7e953ae6f761fe3) (edit) tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java (edit) CHANGES.txt
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you for opening this issue. Thank you, ham1 (Graham Russell) for the PR!

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you for opening this issue. Thank you, ham1 (Graham Russell) for the PR!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/147

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/147
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user ham1 opened a pull request:

          https://github.com/apache/tika/pull/147

          TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage()

          I also improved and added more test cases.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ham1/tika TIKA-2231

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/147.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #147


          commit 5c51534a5731dba0ed22bc04b7da9d95adfb6f50
          Author: Graham Russell <graham@ham1.co.uk>
          Date: 2017-01-17T21:48:49Z

          TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage() and added more tests


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user ham1 opened a pull request: https://github.com/apache/tika/pull/147 TIKA-2231 : Improved param validation of TesseractOCRConfig.setLanguage() I also improved and added more test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ham1/tika TIKA-2231 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/147.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #147 commit 5c51534a5731dba0ed22bc04b7da9d95adfb6f50 Author: Graham Russell <graham@ham1.co.uk> Date: 2017-01-17T21:48:49Z TIKA-2231 : Improved param validation of TesseractOCRConfig.setLanguage() and added more tests

            People

            • Assignee:
              Unassigned
              Reporter:
              pmweiss5 Peter Weiss
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development