Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2293

Tess4jOCRParser - A simpler Java version of TesseractOCRParser

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: ocr
    • Labels:
      None

      Description

      Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process.

        Issue Links

          Activity

          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -

          Thank you Tim Allison . I'll post in the dev once I finalize the documentation for this.

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - Thank you Tim Allison . I'll post in the dev once I finalize the documentation for this.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you Thejan Wijesinghe for all of your work on this! Please keep us posted when/if you choose to host this parser.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you Thejan Wijesinghe for all of your work on this! Please keep us posted when/if you choose to host this parser.
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
          URL: https://github.com/apache/tika/pull/158#issuecomment-299224128

          closing per comments from @tballison

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-299224128 closing per comments from @tballison ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          chrismattmann closed pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
          URL: https://github.com/apache/tika/pull/158

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - chrismattmann closed pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          rferreira commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
          URL: https://github.com/apache/tika/pull/158#issuecomment-299212236

          ah makes sense, thanks!

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - rferreira commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-299212236 ah makes sense, thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          tballison commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
          URL: https://github.com/apache/tika/pull/158#issuecomment-299211242

          See the discussion here: https://issues.apache.org/jira/browse/TIKA-2293 . I think there's consensus that this doesn't buy us enough and actually adds some complexity to our current setup. I proposed moving this into a standalone project/parser that we can mention.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - tballison commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-299211242 See the discussion here: https://issues.apache.org/jira/browse/TIKA-2293 . I think there's consensus that this doesn't buy us enough and actually adds some complexity to our current setup. I proposed moving this into a standalone project/parser that we can mention. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          rferreira commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
          URL: https://github.com/apache/tika/pull/158#issuecomment-299209814

          hey folks, what's the good word on the PR? Seems like a reasonable improvement.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - rferreira commented on issue #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158#issuecomment-299209814 hey folks, what's the good word on the PR? Seems like a reasonable improvement. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Just to contribute to the discussion, my experience says windows users not always have the correct version of Microsoft Redistributable Package installed, and 2015 Redist is very huge, several GBs. And the fact Tess4j only includes native libs for Windows is a limitation, users will still have to install (or compile) tesseract software. The SQLite3Parser includes native libs for windows, linux and mac, and even with those it is an optional dependecy.

          Show
          lfcnassif Luis Filipe Nassif added a comment - Just to contribute to the discussion, my experience says windows users not always have the correct version of Microsoft Redistributable Package installed, and 2015 Redist is very huge, several GBs. And the fact Tess4j only includes native libs for Windows is a limitation, users will still have to install (or compile) tesseract software. The SQLite3Parser includes native libs for windows, linux and mac, and even with those it is an optional dependecy.
          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -

          Thank you, Tim Allison and Thamme Gowda for your responses.
          Tim,
          I understand your concern of trained data files taking so much space. Doubling the size of tika-app and tika-server for a single component of TIKA is not at all a practical thing to do. I am happy that you are willing to promote this parser

          Thamme,
          Yes , I will make this an independent parser that can be pluggable to TIKA, then I will write a wiki page linking it to my repo.

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - Thank you, Tim Allison and Thamme Gowda for your responses. Tim, I understand your concern of trained data files taking so much space. Doubling the size of tika-app and tika-server for a single component of TIKA is not at all a practical thing to do. I am happy that you are willing to promote this parser Thamme, Yes , I will make this an independent parser that can be pluggable to TIKA, then I will write a wiki page linking it to my repo.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I think we take care of this somewhat in Tika 2.0 with modular parser packages...at least large model files. native libs and licenses are still going to be a challenge.

          Show
          tallison@mitre.org Tim Allison added a comment - I think we take care of this somewhat in Tika 2.0 with modular parser packages...at least large model files. native libs and licenses are still going to be a challenge.
          Hide
          thammegowda Thamme Gowda added a comment -

          Tim Allison
          Totally agree with all the points on model files competing to take up many folds of size of Tika core/parser source code.

          I think we just started bringing machine learning capabilities to tika. I forsee more and big model files (new deep learning OCR models, image/video/audio recognition, captioning, .... the list goes on). IMHO, these model files are also equally important.
          In the long run, we end up having either too many REST services (thus making the system too broken) or native dependencies (making it tied to platforms) or the model files (thus making it too fat). We will hit the same discussion again, so I am wondering if we can also consider any alternative future proof solutions to deal with large model files. Perhaps making these models as optional extensions, and not including in core distribution?

          Show
          thammegowda Thamme Gowda added a comment - Tim Allison Totally agree with all the points on model files competing to take up many folds of size of Tika core/parser source code. I think we just started bringing machine learning capabilities to tika. I forsee more and big model files (new deep learning OCR models, image/video/audio recognition, captioning, .... the list goes on). IMHO, these model files are also equally important. In the long run, we end up having either too many REST services (thus making the system too broken) or native dependencies (making it tied to platforms) or the model files (thus making it too fat). We will hit the same discussion again, so I am wondering if we can also consider any alternative future proof solutions to deal with large model files. Perhaps making these models as optional extensions, and not including in core distribution?
          Hide
          thammegowda Thamme Gowda added a comment -

          Thanks, Nick Burch and Tim Allison for timely feedback.
          Agree with all the feedback.

          Thejan, great work, your efforts are appreciated as it helped to evaluate and understand pros and cons with JNI based OCR libs.
          I feel Tess4j could have been more modular to selectively include/exclude its native libs and models, but they are not!

          As suggested, please make this as an independent parser under your github repo.
          For an example, you may refer to https://github.com/thammegowda/tika-ner-corenlp.
          We had a similar situation - GPL license and huge model files for NER.
          We made it as an extension to tika and documented it on the wiki https://wiki.apache.org/tika/TikaAndNER#Using_Stanford_CoreNLP_NER

          Show
          thammegowda Thamme Gowda added a comment - Thanks, Nick Burch and Tim Allison for timely feedback. Agree with all the feedback. Thejan , great work, your efforts are appreciated as it helped to evaluate and understand pros and cons with JNI based OCR libs. I feel Tess4j could have been more modular to selectively include/exclude its native libs and models, but they are not! As suggested, please make this as an independent parser under your github repo. For an example, you may refer to https://github.com/thammegowda/tika-ner-corenlp . We had a similar situation - GPL license and huge model files for NER. We made it as an extension to tika and documented it on the wiki https://wiki.apache.org/tika/TikaAndNER#Using_Stanford_CoreNLP_NER
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thejan, Thank you for your work on this parser and for digging into the license issues.

          I regret that I'm against adding tess4j into Tika even with the rococoa license issue taken away.

          We currently have a wall that users must climb over – they have to have Tesseract installed. That requires a certain amount of technical know-how, and, from a support perspective, while we can try to help, we can say that it isn't our responsibility to install Tesseract for them.

          With tess4j, the burden would be on us to hope that the embedded dlls work for Windows users, and it would kind of be on the user to get the right .so for themselves if they're on Linux. The clear wall we currently have disappears, and we now have to help people in this uncertain area where when things go wrong, it is kind of our fault and kind of not our fault.

          We have the same wall now for language packs, and I agree, we would NOT want to ship all the language packs. However, now, the user is responsible for getting his/her own language packs, and we're not in this in-between state as we would be with tess4j where we're giving people the English pack and then we have to support them installing the other language packs.

          Also, with Tess4j, we'd be nearly doubling the size of tika-app/tika-server with just the Windows dlls, and that doesn't even include the Linux .so(s?).

          In my opinion, I'd far prefer our current setup with the overhead of the commandline and slightly slower OCR'ing to the above headaches we'd have with tess4j.

          I would very strongly support adding this as a standalone parser on your own github site or on another third-party site, and I'd be more than happy to promote it and point people to it. I'd also want to run an evaluation on quality/speed between this third party Tesseract integration and ours so that we can understand the differences.

          Show
          tallison@mitre.org Tim Allison added a comment - Thejan , Thank you for your work on this parser and for digging into the license issues. I regret that I'm against adding tess4j into Tika even with the rococoa license issue taken away. We currently have a wall that users must climb over – they have to have Tesseract installed. That requires a certain amount of technical know-how, and, from a support perspective, while we can try to help, we can say that it isn't our responsibility to install Tesseract for them. With tess4j, the burden would be on us to hope that the embedded dlls work for Windows users, and it would kind of be on the user to get the right .so for themselves if they're on Linux. The clear wall we currently have disappears, and we now have to help people in this uncertain area where when things go wrong, it is kind of our fault and kind of not our fault. We have the same wall now for language packs, and I agree, we would NOT want to ship all the language packs. However, now, the user is responsible for getting his/her own language packs, and we're not in this in-between state as we would be with tess4j where we're giving people the English pack and then we have to support them installing the other language packs. Also, with Tess4j, we'd be nearly doubling the size of tika-app/tika-server with just the Windows dlls, and that doesn't even include the Linux .so(s?). In my opinion, I'd far prefer our current setup with the overhead of the commandline and slightly slower OCR'ing to the above headaches we'd have with tess4j. I would very strongly support adding this as a standalone parser on your own github site or on another third-party site, and I'd be more than happy to promote it and point people to it. I'd also want to run an evaluation on quality/speed between this third party Tesseract integration and ours so that we can understand the differences.
          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -

          About automating the download process of trained data for languages:

          Why?
          1. .traineddata files are huge. it's impractical to include even a few language packages in a bundle.

          2. We don't want to burden everyone with the language packages that they don't need.

          How?
          1. How about we use the command line for this matter.
          a) If a TIKA user executes the setLanguage("Some language") method, We can first search the tessdata folder (by parsing a commandline argument) whether We can find the necessary traineddata file. If we can find it, We can proceed with the OCR process.
          b) If it is not found in the tessdata folder, We can simply parse another command line argument to download the necessary traineddata files and move them into the tessdata folder. Then We can proceed with the OCR process.

          Benefits:
          1. This way, We can assure that only the users who have the need to use other language packages downloads it.

          2. Since, We are automating that procedure, users don't have to worry about downloading trained data and moving them into the tessdata folder.

          please give me your feedback on my idea. If you see any other solution to this better than mine, please let me know that as well.

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - About automating the download process of trained data for languages: Why? 1. .traineddata files are huge. it's impractical to include even a few language packages in a bundle. 2. We don't want to burden everyone with the language packages that they don't need. How? 1. How about we use the command line for this matter. a) If a TIKA user executes the setLanguage("Some language") method, We can first search the tessdata folder (by parsing a commandline argument) whether We can find the necessary traineddata file. If we can find it, We can proceed with the OCR process. b) If it is not found in the tessdata folder, We can simply parse another command line argument to download the necessary traineddata files and move them into the tessdata folder. Then We can proceed with the OCR process. Benefits: 1. This way, We can assure that only the users who have the need to use other language packages downloads it. 2. Since, We are automating that procedure, users don't have to worry about downloading trained data and moving them into the tessdata folder. please give me your feedback on my idea. If you see any other solution to this better than mine, please let me know that as well.
          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -

          I just have some sudden good news to tell. I tried tess4J in a mac even after excluding Rococoa dependency, I could run the tests. So I contacted the founder of Tess4J library and asked him about this and this is what he said exactly in his words,
          "You've just uncovered a bug in the software. All these years it has been using an incorrect type – the correct one should be that of Leptonica, not Rococoa. The JNAerator tool we used to generate the Java binding for Leptonica had erroneously identified and used a wrong data type.

          The unneeded dependency, Rococoa, has been removed. It may take a few days to publish the change to Maven.

          That probably has removed the license issue that you experience."

          So, I think we are good to go.

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - I just have some sudden good news to tell. I tried tess4J in a mac even after excluding Rococoa dependency, I could run the tests. So I contacted the founder of Tess4J library and asked him about this and this is what he said exactly in his words, "You've just uncovered a bug in the software. All these years it has been using an incorrect type – the correct one should be that of Leptonica, not Rococoa. The JNAerator tool we used to generate the Java binding for Leptonica had erroneously identified and used a wrong data type. The unneeded dependency, Rococoa, has been removed. It may take a few days to publish the change to Maven. That probably has removed the license issue that you experience." So, I think we are good to go.
          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -

          Thank you Tim and Nick for your responses. It is a pleasure, you guys are there to help me.

          Tess4J's low level API supports obtaining more information on the scanned words such as scanning accuracy, if the word is underlined, bold or italic and etc.

          1a. Ghost4J is used only to convert pdf files to tiff or png. We can safely exclude that and yes, then we would be on our own when it comes to converting pdfs to OCRable formats.

          1b. Rococoa-core is a generic Java wrapper for Cocoa. Cocoa is Mac OS X's native API. I could exclude Rococoa-core and still could run tests without a problem in my linux machine. But I'm not sure about its effect on Mac OS X. (How is the support of TIKA for Mac OS. Sorry for asking this question. I have never tried TIKA on a MAC?)

          3. Yes, for windows the necessarry dlls comes bundled with Tess4J, These dlls are built with VS2015 and therefore they depend on the Visual C++ 2015 Redistributable Packages. So windows users needs to have Visual C++ 2015 Redistributable Packages installed(which I presume most windows users have). According to [1], linux users needs to install libtesseract.so but I didn't have to because I used "Sudo apt-get install tesseract-ocr" but to my amazemant, even after purging tesseract-ocr, I still could run the Tess4JOCRParser tests successfully. Perhaps, purging didn't delete libtesseract.so from the system.

          4. Trained data for English aka eng.traineddata comes bundled with tess4j jar in a folder name tessdata. If the user needs to OCR an image in another language or a combination of languages other than English, He or She will have to download specific trained data from [2] and put that in the tessdata folder.

          4b. I am not sure whether we can give the users the luxury of having these language packages automatically downloaded, if the user set the language to something other than English. Can we create mechanism to download those language packages automatically (Similar to Maven downloading dependencies)? Is that practical?

          [1] http://tess4j.sourceforge.net/usage.html
          [2] https://github.com/tesseract-ocr/tessdata

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - Thank you Tim and Nick for your responses. It is a pleasure, you guys are there to help me. Tess4J's low level API supports obtaining more information on the scanned words such as scanning accuracy, if the word is underlined, bold or italic and etc. 1a. Ghost4J is used only to convert pdf files to tiff or png. We can safely exclude that and yes, then we would be on our own when it comes to converting pdfs to OCRable formats. 1b. Rococoa-core is a generic Java wrapper for Cocoa. Cocoa is Mac OS X's native API. I could exclude Rococoa-core and still could run tests without a problem in my linux machine. But I'm not sure about its effect on Mac OS X. (How is the support of TIKA for Mac OS. Sorry for asking this question. I have never tried TIKA on a MAC?) 3. Yes, for windows the necessarry dlls comes bundled with Tess4J, These dlls are built with VS2015 and therefore they depend on the Visual C++ 2015 Redistributable Packages. So windows users needs to have Visual C++ 2015 Redistributable Packages installed(which I presume most windows users have). According to [1] , linux users needs to install libtesseract.so but I didn't have to because I used "Sudo apt-get install tesseract-ocr" but to my amazemant, even after purging tesseract-ocr, I still could run the Tess4JOCRParser tests successfully. Perhaps, purging didn't delete libtesseract.so from the system. 4. Trained data for English aka eng.traineddata comes bundled with tess4j jar in a folder name tessdata. If the user needs to OCR an image in another language or a combination of languages other than English, He or She will have to download specific trained data from [2] and put that in the tessdata folder. 4b. I am not sure whether we can give the users the luxury of having these language packages automatically downloaded, if the user set the language to something other than English. Can we create mechanism to download those language packages automatically (Similar to Maven downloading dependencies)? Is that practical? [1] http://tess4j.sourceforge.net/usage.html [2] https://github.com/tesseract-ocr/tessdata
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you, Nick. I completely agree.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you, Nick. I completely agree.
          Hide
          gagravarr Nick Burch added a comment -

          We can't include LGPL libraries in our releases, see http://www.apache.org/legal/resolved.html#category-x . So, if we do need those jars, it'd need to live externally and be something that users manually downloaded if they were happy with the license. It could be listed at https://wiki.apache.org/tika/3rd%20party%20parser%20plugins as we do for other GPL / LGPL needing parsers

          I'm not sure bumping the size of the Tika parsers / app / server by 70-odd mb (really need to support Linux + Windows if we're going to make it easy to use) is necessarily a great thing to do by default, especially if it doesn't have enough common languages in

          Show
          gagravarr Nick Burch added a comment - We can't include LGPL libraries in our releases, see http://www.apache.org/legal/resolved.html#category-x . So, if we do need those jars, it'd need to live externally and be something that users manually downloaded if they were happy with the license. It could be listed at https://wiki.apache.org/tika/3rd%20party%20parser%20plugins as we do for other GPL / LGPL needing parsers I'm not sure bumping the size of the Tika parsers / app / server by 70-odd mb (really need to support Linux + Windows if we're going to make it easy to use) is necessarily a great thing to do by default, especially if it doesn't have enough common languages in
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thejan Wijesinghe, thank you for sharing this and running some comparisons with our current Tesseract parser.

          I really like:
          1. The notion that users don't have to figure out how to install Tesseract on their system. "Simple" plug and play.
          2. The theoretical simplicity of not having to create the temp files and make a system call to python and tesseract etc.
          3. The notion of being able to use some of the lower-level features of Tesseract that aren't available from the commandline...but I only have a vague notion of these...what features from the underlying Tesseract do we need that aren't available from the commandline?

          I'm concerned about:
          1a. The LGPL license on ghost4j means that we can't bundle that with our jars. Do I understand the license of ghost4j? If so, and if we don't include ghost4j, what will happen? Is that only used for PDFs...so we'd be on our own for those, right?
          1b. There's another LGPL license on leptonica4j's rococoa dependency. What happens if we can't bundle that?
          2. The general notion of packaging native libs. I undid that choice with our sqlite parser and required that users add that jar to their classpath.
          3. We'd be adding 38 MB to the tika-app and tika-server jars. That's just for the Windows dlls, right? Do I understand correctly that Linux users would be on their own to install libtesseract.so?
          4. tess4j comes with the English language pack. Users who wanted other languages would still have to grab and install the other language packs in the tess-data directory, which cuts into the appeal for "runs tesseract out of the box".

          Show
          tallison@mitre.org Tim Allison added a comment - Thejan Wijesinghe , thank you for sharing this and running some comparisons with our current Tesseract parser. I really like: 1. The notion that users don't have to figure out how to install Tesseract on their system. "Simple" plug and play. 2. The theoretical simplicity of not having to create the temp files and make a system call to python and tesseract etc. 3. The notion of being able to use some of the lower-level features of Tesseract that aren't available from the commandline...but I only have a vague notion of these...what features from the underlying Tesseract do we need that aren't available from the commandline? I'm concerned about: 1a. The LGPL license on ghost4j means that we can't bundle that with our jars. Do I understand the license of ghost4j? If so, and if we don't include ghost4j, what will happen? Is that only used for PDFs...so we'd be on our own for those, right? 1b. There's another LGPL license on leptonica4j's rococoa dependency. What happens if we can't bundle that? 2. The general notion of packaging native libs. I undid that choice with our sqlite parser and required that users add that jar to their classpath. 3. We'd be adding 38 MB to the tika-app and tika-server jars. That's just for the Windows dlls, right? Do I understand correctly that Linux users would be on their own to install libtesseract.so ? 4. tess4j comes with the English language pack. Users who wanted other languages would still have to grab and install the other language packs in the tess-data directory, which cuts into the appeal for "runs tesseract out of the box".
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user ThejanW opened a pull request:

          https://github.com/apache/tika/pull/158

          TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser

          Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ThejanW/tika master

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/158.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #158


          commit 6d6128f02099f4453f1876328c933ede17f7b559
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T05:27:38Z

          Tess4JOCRParser class implemented successfully. I can extract content through Handler now.

          commit 5a44b86807a594318d06d47e8bb890c3cfd7654b
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T05:31:44Z

          Tess4JOCRParser class implemented successfully. I can extract content through Handler now.

          commit def106014347330a8500cf3f615eb49bcd23ca22
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T05:58:07Z

          TODO: Test time evaluations

          commit 825447f39b39fa83180611091067ed1a6373b9d7
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T09:31:48Z

          Wrote the test case to compare the two parsers.

          commit ecbe7a8773ffed723bf9a2a420a64b62ac0860e9
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T11:17:13Z

          Added test images, Reformatted the ocr parser test case

          commit 31c4fb0f0cda5e6df102d09762f2e93aae0e5c4d
          Author: Thamme Gowda <thammegowda@apache.org>
          Date: 2017-03-11T14:42:15Z

          Merge branch 'master' of https://github.com/ThejanW/tika into thejan-tess4j

          commit 4c87364003a7f0dec86932b4e1b28291432e5fcb
          Author: Thamme Gowda <thammegowda@apache.org>
          Date: 2017-03-11T15:54:55Z

          performance improvements + code clean

          commit 9e672e9da7ff35400c24b21644924b99563999c2
          Author: Thejan Wijesinghe <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-11T17:43:38Z

          Merge pull request #1 from thammegowda/thejan-tess4j

          Performance improvements and Fixes

          commit f5f07429e96f32c3e718ccfec8a3163916b29448
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-12T05:51:40Z

          Excluded tess4J from bringing log4j-over-slf4j.jar + some code reformatting

          commit 25bd1c2eb47db7ccc3a30d12fe199c77d2303e8a
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-12T09:00:41Z

          Deleted the use of extractHOCROutput method + Enabled Tesseract's quiite command line option + Code reformatting

          commit e41250af6ca27158d209896577e6e305abcbcb52
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-12T10:19:08Z

          Performance improvements

          commit 94a2a70add233f53779011325a7ff0c94e4e91d7
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-15T11:57:23Z

          Set org.apache.tika.parser.Parser to default.

          commit 75e185a10884d6afe08555050f676f8ea95d66be
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-15T13:03:57Z

          Merge branch 'master' of https://github.com/apache/tika

          1. Please enter a commit message to explain why this merge is necessary,
          2. especially if it merges an updated upstream into a topic branch.
            #
          3. Lines starting with '#' will be ignored, and an empty message aborts
          4. the commit.

          Syncing with the upstream.

          commit 260e9cec23f0bde2de975ac7142132b7ffa1cf17
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-17T17:29:49Z

          TIKA - 2293

          1. Relocate test images
          2. Add deskewing functionality for skewed images
          3. Add new unit tests

          commit 69504c54ffda2c93fb8205e88dd82b3a119455f4
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-17T17:33:24Z

          TIKA - 2293

          1. Add test images

          commit 4058e49f29d081d45bbe84b3ac75267e2a8d7cf0
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-17T17:46:20Z

          TIKA - 2293

          1. Fix minor error in test document path in runBenchmark unit test

          commit 1a188aa2dcd8eb30086fd297cbeb31cfe47f0863
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-18T07:39:15Z

          TIKA - 2293

          1. change the tesseract model to volatile
          2. add informative comments

          commit dd3f3a299b2d7bb742a4fc12133ef500c68a2439
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-18T07:49:42Z

          Merge remote-tracking branch 'upstream/master'

          Sync with upstream

          commit ac6677dab44cef7e1de20181201f4f14103c3d71
          Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk>
          Date: 2017-03-18T08:51:03Z

          TIKA - 2293

          1. remove unnecessary test cases and test images from master

          Show
          githubbot ASF GitHub Bot added a comment - GitHub user ThejanW opened a pull request: https://github.com/apache/tika/pull/158 TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ThejanW/tika master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/158.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #158 commit 6d6128f02099f4453f1876328c933ede17f7b559 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T05:27:38Z Tess4JOCRParser class implemented successfully. I can extract content through Handler now. commit 5a44b86807a594318d06d47e8bb890c3cfd7654b Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T05:31:44Z Tess4JOCRParser class implemented successfully. I can extract content through Handler now. commit def106014347330a8500cf3f615eb49bcd23ca22 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T05:58:07Z TODO: Test time evaluations commit 825447f39b39fa83180611091067ed1a6373b9d7 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T09:31:48Z Wrote the test case to compare the two parsers. commit ecbe7a8773ffed723bf9a2a420a64b62ac0860e9 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T11:17:13Z Added test images, Reformatted the ocr parser test case commit 31c4fb0f0cda5e6df102d09762f2e93aae0e5c4d Author: Thamme Gowda <thammegowda@apache.org> Date: 2017-03-11T14:42:15Z Merge branch 'master' of https://github.com/ThejanW/tika into thejan-tess4j commit 4c87364003a7f0dec86932b4e1b28291432e5fcb Author: Thamme Gowda <thammegowda@apache.org> Date: 2017-03-11T15:54:55Z performance improvements + code clean commit 9e672e9da7ff35400c24b21644924b99563999c2 Author: Thejan Wijesinghe <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-11T17:43:38Z Merge pull request #1 from thammegowda/thejan-tess4j Performance improvements and Fixes commit f5f07429e96f32c3e718ccfec8a3163916b29448 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-12T05:51:40Z Excluded tess4J from bringing log4j-over-slf4j.jar + some code reformatting commit 25bd1c2eb47db7ccc3a30d12fe199c77d2303e8a Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-12T09:00:41Z Deleted the use of extractHOCROutput method + Enabled Tesseract's quiite command line option + Code reformatting commit e41250af6ca27158d209896577e6e305abcbcb52 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-12T10:19:08Z Performance improvements commit 94a2a70add233f53779011325a7ff0c94e4e91d7 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-15T11:57:23Z Set org.apache.tika.parser.Parser to default. commit 75e185a10884d6afe08555050f676f8ea95d66be Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-15T13:03:57Z Merge branch 'master' of https://github.com/apache/tika Please enter a commit message to explain why this merge is necessary, especially if it merges an updated upstream into a topic branch. # Lines starting with '#' will be ignored, and an empty message aborts the commit. Syncing with the upstream. commit 260e9cec23f0bde2de975ac7142132b7ffa1cf17 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-17T17:29:49Z TIKA - 2293 Relocate test images Add deskewing functionality for skewed images Add new unit tests commit 69504c54ffda2c93fb8205e88dd82b3a119455f4 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-17T17:33:24Z TIKA - 2293 Add test images commit 4058e49f29d081d45bbe84b3ac75267e2a8d7cf0 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-17T17:46:20Z TIKA - 2293 Fix minor error in test document path in runBenchmark unit test commit 1a188aa2dcd8eb30086fd297cbeb31cfe47f0863 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-18T07:39:15Z TIKA - 2293 change the tesseract model to volatile add informative comments commit dd3f3a299b2d7bb742a4fc12133ef500c68a2439 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-18T07:49:42Z Merge remote-tracking branch 'upstream/master' Sync with upstream commit ac6677dab44cef7e1de20181201f4f14103c3d71 Author: ThejanW <thejanwijesinghe.14@cse.mrt.ac.lk> Date: 2017-03-18T08:51:03Z TIKA - 2293 remove unnecessary test cases and test images from master
          Hide
          githubbot ASF GitHub Bot added a comment -

          ThejanW opened a new pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser
          URL: https://github.com/apache/tika/pull/158

          Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - ThejanW opened a new pull request #158: TIKA-2293 - Tess4jOCRParser - A simpler Java version of TesseractOCRParser URL: https://github.com/apache/tika/pull/158 Right now, TesseractOCRParser calls tesseract and imagemagick from command line. Intention of this new parser "Tess4jOCRParser" is to use the Tess4J API instead of the runtime.exec way to executing tesseract out of process. Please feel free to visit TIKA-2293 for more information. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -

          Other than that, I have also added a image preprocessing function to the Tess4JOCRParser, It only supports OCRing rotated images at the moment. But it is not using any python script like Rotation.py to calculate the rotation angle or Imagemagick to correct the image angle. It is a pretty straight forward approach, I have implemented here. So no redundant I/O, making temporary resources. So, I presume it is faster.

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - Other than that, I have also added a image preprocessing function to the Tess4JOCRParser, It only supports OCRing rotated images at the moment. But it is not using any python script like Rotation.py to calculate the rotation angle or Imagemagick to correct the image angle. It is a pretty straight forward approach, I have implemented here. So no redundant I/O, making temporary resources. So, I presume it is faster.
          Hide
          ThejanWijesinghe Thejan Wijesinghe added a comment -
          1. So I have created Tess4JOCRParser and it is working smoothly with multiple image types including png, jpg, jpeg, tiff, bmp, gif, jp2, jpx and ppm.
          1. I wrote a benchmark test to compare this parser with the TesseractOCRParser and you can see the results below,
          1. TesseractOCRParser took 449 seconds to OCR 100 images while Tess4JOCRParser only took 417 seconds. This result varies time to time, but most of the times Tess4JOCRParser OCR an image, 300 ms faster than the TesseractOCRParser, refer to the following links to refer to the source files in my repo.

          https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tess4JOCRParser.java
          https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tess4JOCRParserTest.java

          Show
          ThejanWijesinghe Thejan Wijesinghe added a comment - So I have created Tess4JOCRParser and it is working smoothly with multiple image types including png, jpg, jpeg, tiff, bmp, gif, jp2, jpx and ppm. I wrote a benchmark test to compare this parser with the TesseractOCRParser and you can see the results below, TesseractOCRParser took 449 seconds to OCR 100 images while Tess4JOCRParser only took 417 seconds. This result varies time to time, but most of the times Tess4JOCRParser OCR an image, 300 ms faster than the TesseractOCRParser, refer to the following links to refer to the source files in my repo. https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tess4JOCRParser.java https://github.com/ThejanW/tika/blob/TIKA-2293/tika-parsers/src/test/java/org/apache/tika/parser/ocr/Tess4JOCRParserTest.java

            People

            • Assignee:
              Unassigned
              Reporter:
              ThejanWijesinghe Thejan Wijesinghe
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development