Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2293

Tess4jOCRParser - A simpler Java version of TesseractOCRParser

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.15
    • Component/s: ocr
    • Labels:
      None

      Description

      Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process.

        Issue Links

          Activity

          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -
          1. So I have created Tess4JOCRParser and it is working smoothly with multiple image types including png, jpg, jpeg, tiff, bmp, gif, jp2, jpx and ppm.
          1. I wrote a benchmark test to compare this parser with the TesseractOCRParser and you can see the results below,
          1. TesseractOCRParser took 449 seconds to OCR 100 images while Tess4JOCRParser only took 417 seconds. This result varies time to time, but most of the times Tess4JOCRParser OCR an image, 300 ms faster than the TesseractOCRParser, refer to the following links to refer to the source files in my repo.

          https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tess4JOCRParser.java
          https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tess4JOCRParserTest.java

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - So I have created Tess4JOCRParser and it is working smoothly with multiple image types including png, jpg, jpeg, tiff, bmp, gif, jp2, jpx and ppm. I wrote a benchmark test to compare this parser with the TesseractOCRParser and you can see the results below, TesseractOCRParser took 449 seconds to OCR 100 images while Tess4JOCRParser only took 417 seconds. This result varies time to time, but most of the times Tess4JOCRParser OCR an image, 300 ms faster than the TesseractOCRParser, refer to the following links to refer to the source files in my repo. https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tess4JOCRParser.java https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tess4JOCRParserTest.java
          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -

          Other than that, I have also added a image preprocessing function to the Tess4JOCRParser, It only supports OCRing rotated images at the moment. But it is not using any python script like Rotation.py to calculate the rotation angle or Imagemagick to correct the image angle. It is a pretty straight forward approach, I have implemented here. So no redundant I/O, making temporary resources. So, I presume it is faster.

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - Other than that, I have also added a image preprocessing function to the Tess4JOCRParser, It only supports OCRing rotated images at the moment. But it is not using any python script like Rotation.py to calculate the rotation angle or Imagemagick to correct the image angle. It is a pretty straight forward approach, I have implemented here. So no redundant I/O, making temporary resources. So, I presume it is faster.
          Hide
          githubbot ASF GitHub Bot added a comment -

          ThejanW opened a new pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
          URL: https://github.com/apache/tika/pull/158

          Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - ThejanW opened a new pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158 Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user ThejanW opened a pull request:

          https://github.com/apache/tika/pull/158

          TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser

          Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ThejanW/tika master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/158.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #158


          commit 6d6128f02099f4453f1876328c933ede17f7b559
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T05:27:38Z

          Tess4JOCRParser class implemented successfully. I can extract content through Handler now.

          commit 5a44b86807a594318d06d47e8bb890c3cfd7654b
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T05:31:44Z

          Tess4JOCRParser class implemented successfully. I can extract content through Handler now.

          commit def106014347330a8500cf3f615eb49bcd23ca22
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T05:58:07Z

          TODO: Test time evaluations

          commit 825447f39b39fa83180611091067ed1a6373b9d7
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T09:31:48Z

          Wrote the test case to compare the two parsers.

          commit ecbe7a8773ffed723bf9a2a420a64b62ac0860e9
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T11:17:13Z

          Added test images, Reformatted the ocr parser test case

          commit 31c4fb0f0cda5e6df102d09762f2e93aae0e5c4d
          Author: Thamme Gowda <thammegowda@apache.org>
          Date: 2017-03-11T14:42:15Z

          Merge branch 'master' of https://github.com/ThejanW/tika into thejan-tess4j

          commit 4c87364003a7f0dec86932b4e1b28291432e5fcb
          Author: Thamme Gowda <thammegowda@apache.org>
          Date: 2017-03-11T15:54:55Z

          performance improvements + code clean

          commit 9e672e9da7ff35400c24b21644924b99563999c2
          Author: Thejan Wijesinghe <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T17:43:38Z

          Merge pull request #1 from thammegowda/thejan-tess4j

          Performance improvements and Fixes

          commit f5f07429e96f32c3e718ccfec8a3163916b29448
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-12T05:51:40Z

          Excluded tess4J from bringing log4j-over-slf4j.jar + some code reformatting

          commit 25bd1c2eb47db7ccc3a30d12fe199c77d2303e8a
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-12T09:00:41Z

          Deleted the use of extractHOCROutput method + Enabled Tesseract's quiite command line option + Code reformatting

          commit e41250af6ca27158d209896577e6e305abcbcb52
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-12T10:19:08Z

          Performance improvements

          commit 94a2a70add233f53779011325a7ff0c94e4e91d7
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-15T11:57:23Z

          Set org.apache.tika.parser.Parser to default.

          commit 75e185a10884d6afe08555050f676f8ea95d66be
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-15T13:03:57Z

          Merge branch 'master' of https://github.com/apache/tika

          1. Please enter a commit message to explain why this merge is necessary,
          2. especially if it merges an updated upstream into a topic branch.
            #
          3. Lines starting with '#' will be ignored, and an empty message aborts
          4. the commit.

          Syncing with the upstream.

          commit 260e9cec23f0bde2de975ac7142132b7ffa1cf17
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-17T17:29:49Z

          TIKA - 2293

          1. Relocate test images
          2. Add deskewing functionality for skewed images
          3. Add new unit tests

          commit 69504c54ffda2c93fb8205e88dd82b3a119455f4
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-17T17:33:24Z

          TIKA - 2293

          1. Add test images

          commit 4058e49f29d081d45bbe84b3ac75267e2a8d7cf0
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-17T17:46:20Z

          TIKA - 2293

          1. Fix minor error in test document path in runBenchmark unit test

          commit 1a188aa2dcd8eb30086fd297cbeb31cfe47f0863
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-18T07:39:15Z

          TIKA - 2293

          1. change the tesseract model to volatile
          2. add informative comments

          commit dd3f3a299b2d7bb742a4fc12133ef500c68a2439
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-18T07:49:42Z

          Merge remote-tracking branch 'upstream/master'

          Sync with upstream

          commit ac6677dab44cef7e1de20181201f4f14103c3d71
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-18T08:51:03Z

          TIKA - 2293

          1. remove unnecessary test cases and test images from master

          Show
          githubbot ASF GitHub Bot added a comment - GitHub user ThejanW opened a pull request: https://github.com/apache/tika/pull/158 TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ThejanW/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/158.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #158 commit 6d6128f02099f4453f1876328c933ede17f7b559 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T05:27:38Z Tess4JOCRParser class implemented successfully. I can extract content through Handler now. commit 5a44b86807a594318d06d47e8bb890c3cfd7654b Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T05:31:44Z Tess4JOCRParser class implemented successfully. I can extract content through Handler now. commit def106014347330a8500cf3f615eb49bcd23ca22 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T05:58:07Z TODO: Test time evaluations commit 825447f39b39fa83180611091067ed1a6373b9d7 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T09:31:48Z Wrote the test case to compare the two parsers. commit ecbe7a8773ffed723bf9a2a420a64b62ac0860e9 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T11:17:13Z Added test images, Reformatted the ocr parser test case commit 31c4fb0f0cda5e6df102d09762f2e93aae0e5c4d Author: Thamme Gowda <thammegowda@apache.org> Date: 2017-03-11T14:42:15Z Merge branch 'master' of https://github.com/ThejanW/tika into thejan-tess4j commit 4c87364003a7f0dec86932b4e1b28291432e5fcb Author: Thamme Gowda <thammegowda@apache.org> Date: 2017-03-11T15:54:55Z performance improvements + code clean commit 9e672e9da7ff35400c24b21644924b99563999c2 Author: Thejan Wijesinghe <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T17:43:38Z Merge pull request #1 from thammegowda/thejan-tess4j Performance improvements and Fixes commit f5f07429e96f32c3e718ccfec8a3163916b29448 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-12T05:51:40Z Excluded tess4J from bringing log4j-over-slf4j.jar + some code reformatting commit 25bd1c2eb47db7ccc3a30d12fe199c77d2303e8a Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-12T09:00:41Z Deleted the use of extractHOCROutput method + Enabled Tesseract's quiite command line option + Code reformatting commit e41250af6ca27158d209896577e6e305abcbcb52 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-12T10:19:08Z Performance improvements commit 94a2a70add233f53779011325a7ff0c94e4e91d7 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-15T11:57:23Z Set org.apache.tika.parser.Parser to default. commit 75e185a10884d6afe08555050f676f8ea95d66be Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-15T13:03:57Z Merge branch 'master' of https://github.com/apache/tika Please enter a commit message to explain why this merge is necessary, especially if it merges an updated upstream into a topic branch. # Lines starting with '#' will be ignored, and an empty message aborts the commit. Syncing with the upstream. commit 260e9cec23f0bde2de975ac7142132b7ffa1cf17 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-17T17:29:49Z TIKA - 2293 Relocate test images Add deskewing functionality for skewed images Add new unit tests commit 69504c54ffda2c93fb8205e88dd82b3a119455f4 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-17T17:33:24Z TIKA - 2293 Add test images commit 4058e49f29d081d45bbe84b3ac75267e2a8d7cf0 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-17T17:46:20Z TIKA - 2293 Fix minor error in test document path in runBenchmark unit test commit 1a188aa2dcd8eb30086fd297cbeb31cfe47f0863 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-18T07:39:15Z TIKA - 2293 change the tesseract model to volatile add informative comments commit dd3f3a299b2d7bb742a4fc12133ef500c68a2439 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-18T07:49:42Z Merge remote-tracking branch 'upstream/master' Sync with upstream commit ac6677dab44cef7e1de20181201f4f14103c3d71 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-18T08:51:03Z TIKA - 2293 remove unnecessary test cases and test images from master
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thejan Wijesinghe, thank you for sharing this and running some comparisons with our current Tesseract parser.

          I really like:
          1. The notion that users don't have to figure out how to install Tesseract on their system. "Simple" plug and play.
          2. The theoretical simplicity of not having to create the temp files and make a system call to python and tesseract etc.
          3. The notion of being able to use some of the lower-level features of Tesseract that aren't available from the commandline...but I only have a vague notion of these...what features from the underlying Tesseract do we need that aren't available from the commandline?

          I'm concerned about:
          1a. The LGPL license on ghost4j means that we can't bundle that with our jars. Do I understand the license of ghost4j? If so, and if we don't include ghost4j, what will happen? Is that only used for PDFs...so we'd be on our own for those, right?
          1b. There's another LGPL license on leptonica4j's rococoa dependency. What happens if we can't bundle that?
          2. The general notion of packaging native libs. I undid that choice with our sqlite parser and required that users add that jar to their classpath.
          3. We'd be adding 38 MB to the tika-app and tika-server jars. That's just for the Windows dlls, right? Do I understand correctly that Linux users would be on their own to install libtesseract.so?
          4. tess4j comes with the English language pack. Users who wanted other languages would still have to grab and install the other language packs in the tess-data directory, which cuts into the appeal for "runs tesseract out of the box".

          Show
          tallison@mitre.org Tim Allison added a comment - Thejan Wijesinghe , thank you for sharing this and running some comparisons with our current Tesseract parser. I really like: 1. The notion that users don't have to figure out how to install Tesseract on their system. "Simple" plug and play. 2. The theoretical simplicity of not having to create the temp files and make a system call to python and tesseract etc. 3. The notion of being able to use some of the lower-level features of Tesseract that aren't available from the commandline...but I only have a vague notion of these...what features from the underlying Tesseract do we need that aren't available from the commandline? I'm concerned about: 1a. The LGPL license on ghost4j means that we can't bundle that with our jars. Do I understand the license of ghost4j? If so, and if we don't include ghost4j, what will happen? Is that only used for PDFs...so we'd be on our own for those, right? 1b. There's another LGPL license on leptonica4j's rococoa dependency. What happens if we can't bundle that? 2. The general notion of packaging native libs. I undid that choice with our sqlite parser and required that users add that jar to their classpath. 3. We'd be adding 38 MB to the tika-app and tika-server jars. That's just for the Windows dlls, right? Do I understand correctly that Linux users would be on their own to install libtesseract.so ? 4. tess4j comes with the English language pack. Users who wanted other languages would still have to grab and install the other language packs in the tess-data directory, which cuts into the appeal for "runs tesseract out of the box".
          Hide
          gagravarr Nick Burch added a comment -

          We can't include LGPL libraries in our releases, see http://www.apache.org/legal/resolved.html#category-x . So, if we do need those jars, it'd need to live externally and be something that users manually downloaded if they were happy with the license. It could be listed at https://wiki.apache.org/tika/3rd%20party%20parser%20plugins as we do for other GPL / LGPL needing parsers

          I'm not sure bumping the size of the Tika parsers / app / server by 70-odd mb (really need to support Linux + Windows if we're going to make it easy to use) is necessarily a great thing to do by default, especially if it doesn't have enough common languages in

          Show
          gagravarr Nick Burch added a comment - We can't include LGPL libraries in our releases, see http://www.apache.org/legal/resolved.html#category-x . So, if we do need those jars, it'd need to live externally and be something that users manually downloaded if they were happy with the license. It could be listed at https://wiki.apache.org/tika/3rd%20party%20parser%20plugins as we do for other GPL / LGPL needing parsers I'm not sure bumping the size of the Tika parsers / app / server by 70-odd mb (really need to support Linux + Windows if we're going to make it easy to use) is necessarily a great thing to do by default, especially if it doesn't have enough common languages in
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you, Nick. I completely agree.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you, Nick. I completely agree.

            People

            • Assignee:
              Unassigned
              Reporter:
              ThejanWijesinghe Thejan Wijesinghe
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development