Details

    • Type: New Feature New Feature
    • Status: In Progress
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.7
    • Component/s: parser
    • Labels:
      None

      Description

      I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files.

      1. TesseractOCR_Tyler_v2.patch
        18 kB
        Tyler Palsulich
      2. TesseractOCR_Tyler.patch
        17 kB
        Tyler Palsulich
      3. TesseractOCRParser.patch
        25 kB
        Luis Filipe Nassif
      4. TesseractOCRParser.patch
        26 kB
        Luis Filipe Nassif
      5. testOCR.docx
        61 kB
        Grant Ingersoll
      6. testOCR.pdf
        41 kB
        Grant Ingersoll
      7. testOCR.pptx
        77 kB
        Grant Ingersoll
      8. TIKA-93.patch
        40 kB
        Grant Ingersoll
      9. TIKA-93.patch
        38 kB
        Grant Ingersoll
      10. TIKA-93.patch
        28 kB
        Grant Ingersoll
      11. TIKA-93.patch
        21 kB
        Grant Ingersoll

        Activity

        Hide
        Jukka Zitting added a comment -

        OCRopus (http://code.google.com/p/ocropus/) seems like a nice tool for this. It's a command like tool so we'd need to use something like the ExternalParser class to use it, but the annotated HTML output it generates is already very close to what Tika uses, so the integration should be easy.

        Show
        Jukka Zitting added a comment - OCRopus ( http://code.google.com/p/ocropus/ ) seems like a nice tool for this. It's a command like tool so we'd need to use something like the ExternalParser class to use it, but the annotated HTML output it generates is already very close to what Tika uses, so the integration should be easy.
        Hide
        Joachim Zittmayr added a comment - - edited

        are there any updates regarding this issue?

        Show
        Joachim Zittmayr added a comment - - edited are there any updates regarding this issue?
        Hide
        Jukka Zitting added a comment -

        > are there any updates regarding this issue?

        Not really. I've done some simple tests with ExternalParser invoking Tesseract and OCRopus, but neither is really suited for simple OOTB integration.

        I also tried the commercial Asprise OCR SDK (http://asprise.com/product/ocr/index.php?lang=java) which was much easier to set up and get reasonable results from, but obviously it's something that we can't use in an Apache project.

        If someone wants to help with this, the first step would be to come up with reasonably simple steps to get a liberally licensed OCR engine like OCRopus installed and configured so that you can invoke it using a simple command line like "ocr image.gif" and get the extracted text on the standard output. It should work for at least a few simple test cases. Note that this work should be contributed back to the upstream project.

        Once we have something like that, we can move forward with integrating it to Tika.

        Show
        Jukka Zitting added a comment - > are there any updates regarding this issue? Not really. I've done some simple tests with ExternalParser invoking Tesseract and OCRopus, but neither is really suited for simple OOTB integration. I also tried the commercial Asprise OCR SDK ( http://asprise.com/product/ocr/index.php?lang=java ) which was much easier to set up and get reasonable results from, but obviously it's something that we can't use in an Apache project. If someone wants to help with this, the first step would be to come up with reasonably simple steps to get a liberally licensed OCR engine like OCRopus installed and configured so that you can invoke it using a simple command line like "ocr image.gif" and get the extracted text on the standard output. It should work for at least a few simple test cases. Note that this work should be contributed back to the upstream project. Once we have something like that, we can move forward with integrating it to Tika.
        Hide
        Mike added a comment -

        It's been a while since this bug has been visited. I have an upstream issue on my project that requires OCR, and it would be great if we could get this moving again. I don't have the resources to develop my own OCR system, so it would be amazing if somebody got it into Tika.

        Show
        Mike added a comment - It's been a while since this bug has been visited. I have an upstream issue on my project that requires OCR, and it would be great if we could get this moving again. I don't have the resources to develop my own OCR system, so it would be amazing if somebody got it into Tika.
        Hide
        Nick Burch added a comment -

        There's been a bit more work on the External Parser support (see TIKA-634), which may make the calling more flexible.

        The main missing step at the moment is getting a single command line program we can run to do the OCR, as Jukka said likely using Tesseract or OCRopus.

        Show
        Nick Burch added a comment - There's been a bit more work on the External Parser support (see TIKA-634 ), which may make the calling more flexible. The main missing step at the moment is getting a single command line program we can run to do the OCR, as Jukka said likely using Tesseract or OCRopus.
        Hide
        Enrico Stahn added a comment -

        You could use docsplit which is a wrapper around Tesseract and other tools but has probably a simpler API.

        Show
        Enrico Stahn added a comment - You could use docsplit which is a wrapper around Tesseract and other tools but has probably a simpler API.
        Hide
        Pei Chen added a comment -

        Have you seen JavaOCR (pure java ocr and BSD licensed): http://sourceforge.net/projects/javaocr/
        I have not tried it out myself yet (looks like 1.0 was just released about 1 week ago).
        I think a pure java implementation may be easier than forking another process (exec cpp) or introduce jni dependencies.
        If interested, I could give it a whirl the next chance I get...

        Show
        Pei Chen added a comment - Have you seen JavaOCR (pure java ocr and BSD licensed): http://sourceforge.net/projects/javaocr/ I have not tried it out myself yet (looks like 1.0 was just released about 1 week ago). I think a pure java implementation may be easier than forking another process (exec cpp) or introduce jni dependencies. If interested, I could give it a whirl the next chance I get...
        Hide
        Jukka Zitting added a comment -

        JavaOCR looks interesting, and it looks like it's also available on the central Maven repository.

        Show
        Jukka Zitting added a comment - JavaOCR looks interesting, and it looks like it's also available on the central Maven repository.
        Hide
        Maciej Lizewski added a comment -

        anything new in this topic? someone tried that JavaOCR library with success? Does anybody has working tika+ocr configuration?

        Show
        Maciej Lizewski added a comment - anything new in this topic? someone tried that JavaOCR library with success? Does anybody has working tika+ocr configuration?
        Hide
        Pei Chen added a comment -

        I tried their javaocr-20100605 release with just ascii scanned digits and it seems to worked as advertised. It was fairly easy to use/setup- However, I noticed that their latest release have a lot of work geared towards android development. I didn't get a chance to try integrating it with Tika yet however.
        Are there any preferences on how it should flow in the context of Tika?

        Show
        Pei Chen added a comment - I tried their javaocr-20100605 release with just ascii scanned digits and it seems to worked as advertised. It was fairly easy to use/setup- However, I noticed that their latest release have a lot of work geared towards android development. I didn't get a chance to try integrating it with Tika yet however. Are there any preferences on how it should flow in the context of Tika?
        Hide
        frank added a comment -

        this feature is really useful and helpful.

        Show
        frank added a comment - this feature is really useful and helpful.
        Hide
        Grant Ingersoll added a comment -

        I'm noodling around with producing a patch for this and have a few questions for the group:

        1. Where in Tika do people usually put these kind of "downstream" tasks? Presumably we would need to work with the mime type detection process to know that the input is something that is binary and potentially OCR-able. I would imagine we would want something that inserts between Detection and Parsing. I'd also suggest we make it pluggable, so that we can support other OCR solutions.
        2. Is anyone aware of anything in PDFBox that allows you to know if a document is an Image based PDF?
        Show
        Grant Ingersoll added a comment - I'm noodling around with producing a patch for this and have a few questions for the group: Where in Tika do people usually put these kind of "downstream" tasks? Presumably we would need to work with the mime type detection process to know that the input is something that is binary and potentially OCR-able. I would imagine we would want something that inserts between Detection and Parsing. I'd also suggest we make it pluggable, so that we can support other OCR solutions. Is anyone aware of anything in PDFBox that allows you to know if a document is an Image based PDF?
        Hide
        Grant Ingersoll added a comment -

        Is anyone aware of anything in PDFBox that allows you to know if a document is an Image based PDF

        I figured this one out using the ExtractImages class included in PDFBox.

        Show
        Grant Ingersoll added a comment - Is anyone aware of anything in PDFBox that allows you to know if a document is an Image based PDF I figured this one out using the ExtractImages class included in PDFBox.
        Hide
        Chris A. Mattmann added a comment -

        Good work on #2 Grant. As for #1, you could take several paths like you mentioned:

        1. Build an OCR parser, and then intercept (using MIME detection/MAGIC and/or precedence; or by writing a custom Detector) the detection step to map your OCR parser to PDF files.

        2. Declare your Parser's support via its static SUPPORTED_TYPES for PDF related OCR.

        Happy to iterate with you on this.

        Show
        Chris A. Mattmann added a comment - Good work on #2 Grant. As for #1, you could take several paths like you mentioned: 1. Build an OCR parser, and then intercept (using MIME detection/MAGIC and/or precedence; or by writing a custom Detector) the detection step to map your OCR parser to PDF files. 2. Declare your Parser's support via its static SUPPORTED_TYPES for PDF related OCR. Happy to iterate with you on this.
        Hide
        Grant Ingersoll added a comment -

        I thought about the Parser approach, but it doesn't really feel like a Parser. That is, many different things may be images or have embedded images (PDFs, actual images like JPG, etc., embedded images in Word/PPT docs), so I want to take the MIME type and feed it, optionally to the OCR engine which extracts the images and produces one more items of text, which will give me back something I can then pass along to the Parser.

        So, for instance, in the case of a PPT with embedded images, you would:

        1. Detect PPT
        2. Extract/OCR Images
        3. Feed to PPT/POI Parser
        4. Obtain glory

        In a generic sense, what is somewhat needed is a pipeline approach. That being said, I've already got one of those, I just want the library abstraction that Tika gives me to plug and play my OCR tool and get text out of it.

        An alternative would be that Parsers for MIME Types that allow for the content to be an image can optionally take in an OCR Engine and as they do their parsing, they look for images.

        BTW, for JavaOCR, the main issue seems to be getting training data for the image parsing. Tesseract, on the other hand, has a rich set of models out of the box, but is written in C++ (although it has Java wrappers).

        Show
        Grant Ingersoll added a comment - I thought about the Parser approach, but it doesn't really feel like a Parser. That is, many different things may be images or have embedded images (PDFs, actual images like JPG, etc., embedded images in Word/PPT docs), so I want to take the MIME type and feed it, optionally to the OCR engine which extracts the images and produces one more items of text, which will give me back something I can then pass along to the Parser. So, for instance, in the case of a PPT with embedded images, you would: Detect PPT Extract/OCR Images Feed to PPT/POI Parser Obtain glory In a generic sense, what is somewhat needed is a pipeline approach. That being said, I've already got one of those, I just want the library abstraction that Tika gives me to plug and play my OCR tool and get text out of it. An alternative would be that Parsers for MIME Types that allow for the content to be an image can optionally take in an OCR Engine and as they do their parsing, they look for images. BTW, for JavaOCR, the main issue seems to be getting training data for the image parsing. Tesseract, on the other hand, has a rich set of models out of the box, but is written in C++ (although it has Java wrappers).
        Hide
        Chris A. Mattmann added a comment -

        Thanks Grant, obtaining glory is win.
        Still sounds like a Parser to me though, but I'll be interested to see if you whip out some patches and what they would look like. The nice thing about Parsers is that they spit out XHTML and you can then transform it with ContentHandlers, which is where the real pipeline in Tika capabilities are. So moving into Parser ville gets you a pipeline effect downstream at least.

        Show
        Chris A. Mattmann added a comment - Thanks Grant, obtaining glory is win. Still sounds like a Parser to me though, but I'll be interested to see if you whip out some patches and what they would look like. The nice thing about Parsers is that they spit out XHTML and you can then transform it with ContentHandlers, which is where the real pipeline in Tika capabilities are. So moving into Parser ville gets you a pipeline effect downstream at least.
        Hide
        Grant Ingersoll added a comment -

        Chris, are Parsers composable? If it is a Parser, how do I make it work w/ all the different MIME types that have images? (It's been a while since I've contributed to Tika, so please bare with me). Wouldn't we have one off code that essentially hacks in OCR to the various different parsers? I'm thinking there must be some way to normalize/simplify it. I'll take a poke through the Parsers at a deeper level. Maybe a Parser takes in an OCR Engine, which is an implementation of something like Tesseract or JavaOCR.

        Show
        Grant Ingersoll added a comment - Chris, are Parsers composable? If it is a Parser, how do I make it work w/ all the different MIME types that have images? (It's been a while since I've contributed to Tika, so please bare with me). Wouldn't we have one off code that essentially hacks in OCR to the various different parsers? I'm thinking there must be some way to normalize/simplify it. I'll take a poke through the Parsers at a deeper level. Maybe a Parser takes in an OCR Engine, which is an implementation of something like Tesseract or JavaOCR.
        Hide
        Chris A. Mattmann added a comment -

        Grant no problem at all and happy to bear with ya. It's been a while since I delved deep into the code myself
        Parsers are composable, there is a CompositeParser here:
        http://tika.apache.org/1.4/api/org/apache/tika/parser/CompositeParser.html

        So yeah you could have a OCRBaseParser extends CompositeParser and then calls super with the List<Parser> of parsers to call along with a specific MIMEregistry, etc.) And yep one could be Tesseract or JavaOCR, etc.

        Show
        Chris A. Mattmann added a comment - Grant no problem at all and happy to bear with ya. It's been a while since I delved deep into the code myself Parsers are composable, there is a CompositeParser here: http://tika.apache.org/1.4/api/org/apache/tika/parser/CompositeParser.html So yeah you could have a OCRBaseParser extends CompositeParser and then calls super with the List<Parser> of parsers to call along with a specific MIMEregistry, etc.) And yep one could be Tesseract or JavaOCR, etc.
        Hide
        Grant Ingersoll added a comment -

        Food for thought:

        We introduce OCRParser that extends Parser (and we'd likely have a base class too)
        In the Context, we set the instance, just like we do w/ the Parser.class:

        context.set(Parser.class, parser);

        i.e.

        context.set(OCRParser.class, ocrParser);

        Then, we can, over time, add to the various parsers the ability, when detecting Image info, to apply the OCRParser in the context of the current parser. So, for instance, the PDFParser, when detecting an Image could optionally extract text from the images. The other benefit, here, of course, is that the OCRParser implementation will work independently on anything that is an Image.

        Show
        Grant Ingersoll added a comment - Food for thought: We introduce OCRParser that extends Parser (and we'd likely have a base class too) In the Context, we set the instance, just like we do w/ the Parser.class: context.set(Parser.class, parser); i.e. context.set(OCRParser.class, ocrParser); Then, we can, over time, add to the various parsers the ability, when detecting Image info, to apply the OCRParser in the context of the current parser. So, for instance, the PDFParser, when detecting an Image could optionally extract text from the images. The other benefit, here, of course, is that the OCRParser implementation will work independently on anything that is an Image.
        Hide
        Grant Ingersoll added a comment -

        Well, Tesseract is out, at least as far as using Tess4j goes, as it has LGPL and BCL dependencies. Ugh, especially since Tesseract itself is ASL. And here Tesseract looks so promising, at least in my initial tests (compared to JavaOCR, which requires a bunch of training work up front)

        Show
        Grant Ingersoll added a comment - Well, Tesseract is out, at least as far as using Tess4j goes, as it has LGPL and BCL dependencies. Ugh, especially since Tesseract itself is ASL. And here Tesseract looks so promising, at least in my initial tests (compared to JavaOCR, which requires a bunch of training work up front)
        Hide
        frank added a comment -

        BTW, does this feature support .TIFF format? we have a lot of files scanned into the computer as TIFF format.

        Show
        frank added a comment - BTW, does this feature support .TIFF format? we have a lot of files scanned into the computer as TIFF format.
        Hide
        Grant Ingersoll added a comment -

        It can, via some ancient JavaIO stuff, which, in some cases, has some weird dependencies. Still working this out, but the way this is shaping up is that it is all going to have to be very pluggable to avoid any of these cases. If anyone is up for lobbying the Tess4J team to remove GPL/LGPL/viral dependencies, we'd be in much better shape.

        Show
        Grant Ingersoll added a comment - It can, via some ancient JavaIO stuff, which, in some cases, has some weird dependencies. Still working this out, but the way this is shaping up is that it is all going to have to be very pluggable to avoid any of these cases. If anyone is up for lobbying the Tess4J team to remove GPL/LGPL/viral dependencies, we'd be in much better shape.
        Hide
        Grant Ingersoll added a comment -

        Here is a very early stage patch that creates a JavaOCR parser. It is not integrated into any of the other parsers, yet.

        I also added Jacoco code coverage to the Parent POM so that we can now generate coverage reports. For example:

        1. mvn verify (from the top level)

        Or, after running mvn test

        1. mvn jacoco:check

        Once done, check the target/site/jacoco directory to see the reports.

        Not sure on Tika workflow for JIRA, but if someone wants to Assign this Issue to me, I'll take it the next few steps.

        Show
        Grant Ingersoll added a comment - Here is a very early stage patch that creates a JavaOCR parser. It is not integrated into any of the other parsers, yet. I also added Jacoco code coverage to the Parent POM so that we can now generate coverage reports. For example: mvn verify (from the top level) Or, after running mvn test mvn jacoco:check Once done, check the target/site/jacoco directory to see the reports. Not sure on Tika workflow for JIRA, but if someone wants to Assign this Issue to me, I'll take it the next few steps.
        Hide
        Grant Ingersoll added a comment -

        Tests for the JavaOCRParser. Next step is to start integrating into various other parsers.

        Show
        Grant Ingersoll added a comment - Tests for the JavaOCRParser. Next step is to start integrating into various other parsers.
        Hide
        Chris A. Mattmann added a comment -

        Hey Grant, patch is looking good! I will need to download it and test it out, but this is just based on a cursory inspection.
        Some comments:

        1. what is the dependency on jacoco in tika-parent? That stuff seems orthogonal to the patch.
        2. maybe think about providing the training directory as part of the ParseContext (maybe a property like o.a.tika.parser.ocr.trainingDataDirPath?)
        3. dependency on custom external Maven repo – myGrid – any way to get the jar from the Central repo somewhere? we have made an effort in Tika to remove any specific deps on external repositories, see: http://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/#.UvaEN0JdWxU

        Looking great. Maybe we can get some of this in 1.6 even with the deps on the external repo but we need to get rid of those before releasing. I will try this out in a few hours! I'm excited b/c I may even be able to use this for the homework assignments in my CS572 class on Search Engines where we look at FBI Vault PDF files! http://www-scf.usc.edu/~csci572/

        Show
        Chris A. Mattmann added a comment - Hey Grant, patch is looking good! I will need to download it and test it out, but this is just based on a cursory inspection. Some comments: what is the dependency on jacoco in tika-parent? That stuff seems orthogonal to the patch. maybe think about providing the training directory as part of the ParseContext (maybe a property like o.a.tika.parser.ocr.trainingDataDirPath?) dependency on custom external Maven repo – myGrid – any way to get the jar from the Central repo somewhere? we have made an effort in Tika to remove any specific deps on external repositories, see: http://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/#.UvaEN0JdWxU Looking great. Maybe we can get some of this in 1.6 even with the deps on the external repo but we need to get rid of those before releasing. I will try this out in a few hours! I'm excited b/c I may even be able to use this for the homework assignments in my CS572 class on Search Engines where we look at FBI Vault PDF files! http://www-scf.usc.edu/~csci572/
        Hide
        Grant Ingersoll added a comment -

        This shows what I am thinking for integration with PDFParser. Not sure if it fits with what others have in mind when it comes to how the OCRParser gets integrated.

        Show
        Grant Ingersoll added a comment - This shows what I am thinking for integration with PDFParser. Not sure if it fits with what others have in mind when it comes to how the OCRParser gets integrated.
        Hide
        Grant Ingersoll added a comment -

        what is the dependency on jacoco in tika-parent? That stuff seems orthogonal to the patch.

        I put that in so that I can measure whether I am testing sufficiently. I can separate it out to a different patch.

        dependency on custom external Maven repo – myGrid – any way to get the jar from the Central repo somewhere? we have made an effort in Tika to remove any specific deps on external repositories

        We could make that one optional. All it does is add support for TIFF and a few other file formats that aren't part of the standard ImageIO.

        in my CS572 class on Search Engines where we look at FBI Vault PDF files! http://www-scf.usc.edu/~csci572/

        I read your abstract for your talk and checked out the Vault and thought it would be cool, too. The main issue is that JavaOCR needs to be trained in order to work with that data set. Tesseract, on the other hand, works for it, but alas, needs to be implemented as an OCRParser. Since Tess4J has some bad deps, the only way I could see to do this is to exec the process or go write my own JNI integration for Tesseract. The latter isn't likely to happen. The former feels less than desirable, but would work.

        Show
        Grant Ingersoll added a comment - what is the dependency on jacoco in tika-parent? That stuff seems orthogonal to the patch. I put that in so that I can measure whether I am testing sufficiently. I can separate it out to a different patch. dependency on custom external Maven repo – myGrid – any way to get the jar from the Central repo somewhere? we have made an effort in Tika to remove any specific deps on external repositories We could make that one optional. All it does is add support for TIFF and a few other file formats that aren't part of the standard ImageIO. in my CS572 class on Search Engines where we look at FBI Vault PDF files! http://www-scf.usc.edu/~csci572/ I read your abstract for your talk and checked out the Vault and thought it would be cool, too. The main issue is that JavaOCR needs to be trained in order to work with that data set. Tesseract, on the other hand, works for it, but alas, needs to be implemented as an OCRParser. Since Tess4J has some bad deps, the only way I could see to do this is to exec the process or go write my own JNI integration for Tesseract. The latter isn't likely to happen. The former feels less than desirable, but would work.
        Show
        Grant Ingersoll added a comment - FYI: http://roncemer.com/software-development/java-ocr/
        Hide
        Grant Ingersoll added a comment -

        Not sure I am happy w/ the changes here yet, esp. the changes to the PDFParserConfig. Probably need a more generic way of knowing whether we turn on/off OCR parsing. I suppose that should go on the ParserContext, but it wasn't obvious to me how one should set boolean flags there. Will poke around more.

        Show
        Grant Ingersoll added a comment - Not sure I am happy w/ the changes here yet, esp. the changes to the PDFParserConfig. Probably need a more generic way of knowing whether we turn on/off OCR parsing. I suppose that should go on the ParserContext, but it wasn't obvious to me how one should set boolean flags there. Will poke around more.
        Hide
        Nick Burch added a comment -

        Generally speaking, when a parser finds embedded resources, it calls out to the Parser on the context to have it processed. You could therefore set your OCR Parser there, and it'd be called for all kinds of embedded resources. It can then OCR any suitable images it finds, and pass on everything else to another parser (eg DefaultParser) to have the non-OCR-able embedded parts handled (if required)

        To handle OCRing of top level content, eg images, you'd need to register your OCR parser as the parser for those types, in place of (or possibly even wrapping) the default parser.

        Show
        Nick Burch added a comment - Generally speaking, when a parser finds embedded resources, it calls out to the Parser on the context to have it processed. You could therefore set your OCR Parser there, and it'd be called for all kinds of embedded resources. It can then OCR any suitable images it finds, and pass on everything else to another parser (eg DefaultParser) to have the non-OCR-able embedded parts handled (if required) To handle OCRing of top level content, eg images, you'd need to register your OCR parser as the parser for those types, in place of (or possibly even wrapping) the default parser.
        Hide
        Grant Ingersoll added a comment -

        Not sure if this is progress or not...

        The testOCR.* files need to go in the parsers/src/test/resources/test-documents directory.

        Things that changed:

        1. Moved config to ParseContext instead of one off implementation in PDFParserConfig.
        2. Used the existing ParseContext for passing in the OCRParser instead of separate handling
        3. Added some more test files. Will upload them.

        Things I could use help on:

        1. Trying to get this integrated into the Office stuff. I see the DELEGATING_PARSER capabilities for embedded extraction, but not quite sure about how to best leverage that. See JavaOCRParserTest.testOCR for some attempts at setting up the test
        2. Overall, my biggest lack of understanding is around how to configure this stuff. As I see it, we need to be able to set 2 things:
          1. The OCRParser or Delegatingparser. I'm not sure how embedded contexts are used in practice. Note, some of the OCRParser implementations will require configuration/training before they can be used.
          2. Whether or not to actually use the OCRParser (a boolean flag), as OCR is expensive and not everyone will want it for every doc, etc.
        Show
        Grant Ingersoll added a comment - Not sure if this is progress or not... The testOCR.* files need to go in the parsers/src/test/resources/test-documents directory. Things that changed: Moved config to ParseContext instead of one off implementation in PDFParserConfig. Used the existing ParseContext for passing in the OCRParser instead of separate handling Added some more test files. Will upload them. Things I could use help on: Trying to get this integrated into the Office stuff. I see the DELEGATING_PARSER capabilities for embedded extraction, but not quite sure about how to best leverage that. See JavaOCRParserTest.testOCR for some attempts at setting up the test Overall, my biggest lack of understanding is around how to configure this stuff. As I see it, we need to be able to set 2 things: The OCRParser or Delegatingparser. I'm not sure how embedded contexts are used in practice. Note, some of the OCRParser implementations will require configuration/training before they can be used. Whether or not to actually use the OCRParser (a boolean flag), as OCR is expensive and not everyone will want it for every doc, etc.
        Hide
        Timo Boehme added a comment -

        I would like to give some comments on detecting/handling of image based PDFs because the proposed solution will only work with a subset of these kind of documents. First one could classify the image based PDF into 3 classes:

        1. image only (one image per page)
        2. image with text overlay/underlay already produced by an OCR process
        3. multiple images per page (instead of one full page image there are images per word/line/paragraph)

        Thus from only testing for a page size image one does not known if we nevertheless have parseable text or if we have a class 3 document (in case of e.g. journals we might even have a full page background image). For an automatic classification one would need to first try to parse text in the standard way for a view pages. One should not expect image-only PDFs to contain no text - in some cases header/footer/page numbers are added as text whereas other content is only an image. An heuristic threshold are 60-80 characters per page below which we can assume to have an image PDF.
        If a PDF is assumed to be an image PDF the pages should be 'printed' into an image (in order to also handle class 3 documents and to keep mixed data (image + text)) and this image should be processed by OCR.

        Best,
        Timo

        Show
        Timo Boehme added a comment - I would like to give some comments on detecting/handling of image based PDFs because the proposed solution will only work with a subset of these kind of documents. First one could classify the image based PDF into 3 classes: image only (one image per page) image with text overlay/underlay already produced by an OCR process multiple images per page (instead of one full page image there are images per word/line/paragraph) Thus from only testing for a page size image one does not known if we nevertheless have parseable text or if we have a class 3 document (in case of e.g. journals we might even have a full page background image). For an automatic classification one would need to first try to parse text in the standard way for a view pages. One should not expect image-only PDFs to contain no text - in some cases header/footer/page numbers are added as text whereas other content is only an image. An heuristic threshold are 60-80 characters per page below which we can assume to have an image PDF. If a PDF is assumed to be an image PDF the pages should be 'printed' into an image (in order to also handle class 3 documents and to keep mixed data (image + text)) and this image should be processed by OCR. Best, Timo
        Hide
        Luis Filipe Nassif added a comment -

        I have tried Tess4J a few months ago. The main problem was JVM crashes with some kinds of images, probably related with native code execution. So I changed to the Tesseract exec approach, that obviously did not cause JVM crashes. But the OCRing of a few images never terminates and some timeout control was necessary, like for other existing Tika parsers.

        Show
        Luis Filipe Nassif added a comment - I have tried Tess4J a few months ago. The main problem was JVM crashes with some kinds of images, probably related with native code execution. So I changed to the Tesseract exec approach, that obviously did not cause JVM crashes. But the OCRing of a few images never terminates and some timeout control was necessary, like for other existing Tika parsers.
        Hide
        Grant Ingersoll added a comment -

        changed to the Tesseract exec approach

        Can you add it as a patch? I was going to add it this week, but if you already have it, then it would save me some time.

        Timo Boehme

        All good insight. I am not an OCR/Doc expert, so if you could update my patch w/ either comments on where this stuff should go or actual help on it, that would be awesome. I would really love to see OCR support get in soon.

        Show
        Grant Ingersoll added a comment - changed to the Tesseract exec approach Can you add it as a patch? I was going to add it this week, but if you already have it, then it would save me some time. Timo Boehme All good insight. I am not an OCR/Doc expert, so if you could update my patch w/ either comments on where this stuff should go or actual help on it, that would be awesome. I would really love to see OCR support get in soon.
        Hide
        Luis Filipe Nassif added a comment -

        Grant,

        Unfortunately I am currently busy to do a patch and my code is a bit application specific. For example, it saves the extracted text in an output folder to be displayed later to the user without the need to rerun OCR. But I can upload it if you think that will help.

        Show
        Luis Filipe Nassif added a comment - Grant, Unfortunately I am currently busy to do a patch and my code is a bit application specific. For example, it saves the extracted text in an output folder to be displayed later to the user without the need to rerun OCR. But I can upload it if you think that will help.
        Hide
        frank added a comment -

        will this great feature be released in revision 1.5?

        Show
        frank added a comment - will this great feature be released in revision 1.5?
        Hide
        Grant Ingersoll added a comment -

        Frank, no, 1.5 is due out soon (already?) and this isn't close to being ready yet.

        Luis Filipe Nassif: patches welcome!

        Show
        Grant Ingersoll added a comment - Frank, no, 1.5 is due out soon (already?) and this isn't close to being ready yet. Luis Filipe Nassif : patches welcome!
        Hide
        Chris A. Mattmann added a comment -

        It's an awesome feature and we'll work our best to include it in 1.6.

        Show
        Chris A. Mattmann added a comment - It's an awesome feature and we'll work our best to include it in 1.6.
        Hide
        Luis Filipe Nassif added a comment - - edited

        Another approach would be to include images and pdf into supportedTypes of OCRParser and call their respective parsers within the OCRParser, instead of modifying the code of existing parsers.

        About enabling and configuring the OCRParser, it could be included in tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser and could be passed a OCRConfig object via parseContext. If not passed, OCRParser could simply call the existing image or pdf parser.

        I agree with Timo that it would be better to print pdf to images rather than iterate over its objects, not for all pdfs but for those with few chars. A CharCountContentHandler could be used with PDFParser to test this.

        Finally, Tesseract already includes support for tif files.

        Show
        Luis Filipe Nassif added a comment - - edited Another approach would be to include images and pdf into supportedTypes of OCRParser and call their respective parsers within the OCRParser, instead of modifying the code of existing parsers. About enabling and configuring the OCRParser, it could be included in tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser and could be passed a OCRConfig object via parseContext. If not passed, OCRParser could simply call the existing image or pdf parser. I agree with Timo that it would be better to print pdf to images rather than iterate over its objects, not for all pdfs but for those with few chars. A CharCountContentHandler could be used with PDFParser to test this. Finally, Tesseract already includes support for tif files.
        Hide
        Luis Filipe Nassif added a comment -

        Patch with first version of a tesseract-ocr based OCRParser, with simple timeout control.

        Show
        Luis Filipe Nassif added a comment - Patch with first version of a tesseract-ocr based OCRParser, with simple timeout control.
        Hide
        Luis Filipe Nassif added a comment -

        We could populate metadata using existing image parsers. If ocr is not enabled, this parser could only populate image metadata. So we could list this parser in services/org.apache.tika.parser.Parser list without changing default image parsing.

        Show
        Luis Filipe Nassif added a comment - We could populate metadata using existing image parsers. If ocr is not enabled, this parser could only populate image metadata. So we could list this parser in services/org.apache.tika.parser.Parser list without changing default image parsing.
        Hide
        Luis Filipe Nassif added a comment -

        Better timeout control using FutureTask

        Show
        Luis Filipe Nassif added a comment - Better timeout control using FutureTask
        Hide
        Luis Filipe Nassif added a comment -

        Did someone take a look at TesseractOCRParser.patch? I will be happy to improve it.

        Show
        Luis Filipe Nassif added a comment - Did someone take a look at TesseractOCRParser.patch? I will be happy to improve it.
        Hide
        Chris A. Mattmann added a comment -

        It's looking good Luis! This seems to be a good case though for using Tika's External parser package:

        http://tika.apache.org/1.5/api/org/apache/tika/parser/external/package-summary.html

        I noticed that we are creating processes inside of the patch and it would be good maybe to simply make it leverage ExternalParser?
        I'm happy to work through an update to the patch to do that. Give me a day or so.

        Show
        Chris A. Mattmann added a comment - It's looking good Luis! This seems to be a good case though for using Tika's External parser package: http://tika.apache.org/1.5/api/org/apache/tika/parser/external/package-summary.html I noticed that we are creating processes inside of the patch and it would be good maybe to simply make it leverage ExternalParser? I'm happy to work through an update to the patch to do that. Give me a day or so.
        Hide
        Luis Filipe Nassif added a comment -

        Hi Chris,

        I noticed ExternalParser before. It would be very good to leverage it to start the OCR process, but Tesseract appends ".txt" to output filename and needs the environment variable TESSDATA_PREFIX to be setted up. Maybe updating ExternalParser to support an output filename suffix may be needed? And I tried to extend Grant's work, implementing OCRParser interface, so TesseractParser could also be used for images embedded into PDFs. Could that be done by leveraging ExternalParser?

        Show
        Luis Filipe Nassif added a comment - Hi Chris, I noticed ExternalParser before. It would be very good to leverage it to start the OCR process, but Tesseract appends ".txt" to output filename and needs the environment variable TESSDATA_PREFIX to be setted up. Maybe updating ExternalParser to support an output filename suffix may be needed? And I tried to extend Grant's work, implementing OCRParser interface, so TesseractParser could also be used for images embedded into PDFs. Could that be done by leveraging ExternalParser?
        Hide
        Chris A. Mattmann added a comment -

        Hi Luis, yep those are some good ideas and we may need to extend ExternalParser here to provide those capabilities you're talking about. I'll see about extending it b/c I think it's the better way to go. Give me a day or so and I'll hopefully have something.

        Show
        Chris A. Mattmann added a comment - Hi Luis, yep those are some good ideas and we may need to extend ExternalParser here to provide those capabilities you're talking about. I'll see about extending it b/c I think it's the better way to go. Give me a day or so and I'll hopefully have something.
        Hide
        Anurag Indu added a comment -

        Hello All, I tried to use tesseract to extract all the images from the pdf and convert them to their text values. I am using a windows 8 laptop with i5, 8GB Ram and it takes 15 mins to process a single pdf. Could someone point to me to the issue with the code (added below). Where can i improve the performance. I am not using threading here.
        List<?> pages = document.getDocumentCatalog().getAllPages();
        Iterator<?> iter = pages.iterator();
        StringBuilder text = new StringBuilder();
        while (iter.hasNext()) {
        PDPage page = (PDPage) iter.next();
        PDResources resources = page.getResources();
        Map<String, PDXObjectImage> pageImages = resources.getImages();
        if (pageImages != null) {
        Iterator<String> imageIter = pageImages.keySet().iterator();
        while (imageIter.hasNext()) {
        String key = (String) imageIter.next();
        PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
        image.write2file(key);
        Runtime rt = Runtime.getRuntime();
        String command = "\""+ tessPath +"\" \""
        + key + ".tiff\" out";
        Process pr = rt.exec(command);
        try

        { result = pr.waitFor(); }

        catch (InterruptedException e)

        { e.printStackTrace(); }

        if (result == 0)

        { String x = readFile("out.txt", Charset.defaultCharset()); text.append(x); }

        new File(key + ".tiff").delete();
        new File("out.txt").delete();
        }
        }
        }

        Show
        Anurag Indu added a comment - Hello All, I tried to use tesseract to extract all the images from the pdf and convert them to their text values. I am using a windows 8 laptop with i5, 8GB Ram and it takes 15 mins to process a single pdf. Could someone point to me to the issue with the code (added below). Where can i improve the performance. I am not using threading here. List<?> pages = document.getDocumentCatalog().getAllPages(); Iterator<?> iter = pages.iterator(); StringBuilder text = new StringBuilder(); while (iter.hasNext()) { PDPage page = (PDPage) iter.next(); PDResources resources = page.getResources(); Map<String, PDXObjectImage> pageImages = resources.getImages(); if (pageImages != null) { Iterator<String> imageIter = pageImages.keySet().iterator(); while (imageIter.hasNext()) { String key = (String) imageIter.next(); PDXObjectImage image = (PDXObjectImage) pageImages.get(key); image.write2file(key); Runtime rt = Runtime.getRuntime(); String command = "\""+ tessPath +"\" \"" + key + ".tiff\" out"; Process pr = rt.exec(command); try { result = pr.waitFor(); } catch (InterruptedException e) { e.printStackTrace(); } if (result == 0) { String x = readFile("out.txt", Charset.defaultCharset()); text.append(x); } new File(key + ".tiff").delete(); new File("out.txt").delete(); } } }
        Hide
        Timo Boehme added a comment -

        Hi Anurag, which PDF are you referring to? Without knowing the size, page count and structure of the pages it is hard to say what is going wrong. For instance it could be as I already wrote in my last comment that the pages contain a large number of images (e.g. one per word or chunk) instead of a single one per page. Try to print the PDF to images (one per page) and run this through Tesseract.

        Show
        Timo Boehme added a comment - Hi Anurag, which PDF are you referring to? Without knowing the size, page count and structure of the pages it is hard to say what is going wrong. For instance it could be as I already wrote in my last comment that the pages contain a large number of images (e.g. one per word or chunk) instead of a single one per page. Try to print the PDF to images (one per page) and run this through Tesseract.
        Hide
        Tyler Palsulich added a comment -

        Are there any updates with this? I'm interested in getting OCR working. Are you building Tessaract from source, Luis Filipe Nassif?

        Show
        Tyler Palsulich added a comment - Are there any updates with this? I'm interested in getting OCR working. Are you building Tessaract from source, Luis Filipe Nassif ?
        Hide
        Chris A. Mattmann added a comment -

        Thanks Tyler Palsulich haven't worked on this in a while but it's awesome and I'd love to get it in the sources. I would try and build Tessarct from source, and try the latest patch and let me know what you see?

        Show
        Chris A. Mattmann added a comment - Thanks Tyler Palsulich haven't worked on this in a while but it's awesome and I'd love to get it in the sources. I would try and build Tessarct from source, and try the latest patch and let me know what you see?
        Hide
        Luis Filipe Nassif added a comment -

        Hi Tyler Palsulich,

        If you want to try the TesseractOCRParser.patch, you can simply install Tesseract. I implemented the OCRParser interface created by Grant Ingersoll in TIKA-93.patch, but it is not mandatory. You can skip the TIKA-93.patch simply changing the TesseractOCRParser class to extend AbstractParser instead of implementing OCRParser. To enable the new parser, you must list it in tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser and set a TesseractOCRConfig in the parseContext:

        TesseractOCRConfig config = new TesseractOCRConfig();
        //Needed if tesseract is not on system PATH
        config.setTesseractPath(tesseractFolder);
        parseContext.set(TesseractOCRConfig.class, config);
        

        So it will be enabled and automatically run on images, including embedded ones.

        Show
        Luis Filipe Nassif added a comment - Hi Tyler Palsulich , If you want to try the TesseractOCRParser.patch, you can simply install Tesseract. I implemented the OCRParser interface created by Grant Ingersoll in TIKA-93 .patch, but it is not mandatory. You can skip the TIKA-93 .patch simply changing the TesseractOCRParser class to extend AbstractParser instead of implementing OCRParser. To enable the new parser, you must list it in tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser and set a TesseractOCRConfig in the parseContext: TesseractOCRConfig config = new TesseractOCRConfig(); //Needed if tesseract is not on system PATH config.setTesseractPath(tesseractFolder); parseContext.set(TesseractOCRConfig.class, config); So it will be enabled and automatically run on images, including embedded ones.
        Hide
        Tyler Palsulich added a comment -

        Thanks for the help! I applied the patch. But, there are two copies of each class in the patch. Is that intentional? I just deleted one of the copies. I installed Tesseract and added the TesseractOCRParser to the META-INF file, but OCR isn't running on the example PDF. Should I be using an AutoDetectParser? I put my parse-only test below (I put testOCR.pdf in test-documents/ocr/ since it was messing with PDFParserTest.)

            @Test
            public void testPDFOCR() throws Exception {
                Parser parser = new AutoDetectParser();
                BodyContentHandler handler = new BodyContentHandler();
                Metadata metadata = new Metadata();
        
                TesseractOCRConfig config = new TesseractOCRConfig();   // Have Tesseract on my PATH.
                ParseContext parseContext = new ParseContext();
                parseContext.set(TesseractOCRConfig.class, config);
        
                InputStream stream = TesseractOCRTest.class.getResourceAsStream(
                        "/test-documents/ocr/testOCR.pdf");
        
                try {
                    parser.parse(stream, handler, metadata, parseContext);
                } finally {
                    stream.close();
                }
            }
        
        Show
        Tyler Palsulich added a comment - Thanks for the help! I applied the patch. But, there are two copies of each class in the patch. Is that intentional? I just deleted one of the copies. I installed Tesseract and added the TesseractOCRParser to the META-INF file, but OCR isn't running on the example PDF. Should I be using an AutoDetectParser? I put my parse-only test below (I put testOCR.pdf in test-documents/ocr/ since it was messing with PDFParserTest.) @Test public void testPDFOCR() throws Exception { Parser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); TesseractOCRConfig config = new TesseractOCRConfig(); // Have Tesseract on my PATH. ParseContext parseContext = new ParseContext(); parseContext.set(TesseractOCRConfig.class, config); InputStream stream = TesseractOCRTest.class.getResourceAsStream( "/test-documents/ocr/testOCR.pdf" ); try { parser.parse(stream, handler, metadata, parseContext); } finally { stream.close(); } }
        Hide
        Luis Filipe Nassif added a comment -

        It was not intentional, the patch should have only one copy of each class, I will fix it, thank you. You can use an AutoDetectParser to automatically process the PDF. But you must tell Tika what parser it have to use to process embedded files (eg images). If you want to only run OCR on embedded images:

        parseContext.set(Parser.class, new TesseractOCRParser());
        

        If you want to process any kind of embedded file:

        parseContext.set(Parser.class, new AutoDetectParser());
        

        But by default, trunk currently does not extract images from PDF files, see TIKA-1294. Try to turn it on with this code:

        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        

        Let me know if this helps.

        Show
        Luis Filipe Nassif added a comment - It was not intentional, the patch should have only one copy of each class, I will fix it, thank you. You can use an AutoDetectParser to automatically process the PDF. But you must tell Tika what parser it have to use to process embedded files (eg images). If you want to only run OCR on embedded images: parseContext.set(Parser.class, new TesseractOCRParser()); If you want to process any kind of embedded file: parseContext.set(Parser.class, new AutoDetectParser()); But by default, trunk currently does not extract images from PDF files, see TIKA-1294 . Try to turn it on with this code: PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages( true ); parseContext.set(PDFParserConfig.class, pdfConfig); Let me know if this helps.
        Hide
        Tyler Palsulich added a comment -

        Awesome! I attached another patch which includes TesseractOCRParser.patch with unit tests for the parser (PDF, PPTX, and DOCX files with embedded images with text). We could use more tests for images with no next, blurry text, and so on. But, I don't know how good Tesseract is.

        Steps to apply this patch: install Tesseract [1], apply the patch, move the test files into tika-parsers/src/test/resources/test-documents/ocr. Run the tests with mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest -DfailIfNoTests=false.

        What needs to happen from here? How should we include Tesseract in the sources? How should we handle timeouts (give the user a warning that OCR can be slow/timed out)?

        [1] - https://code.google.com/p/tesseract-ocr/wiki/ReadMe

        Show
        Tyler Palsulich added a comment - Awesome! I attached another patch which includes TesseractOCRParser.patch with unit tests for the parser (PDF, PPTX, and DOCX files with embedded images with text). We could use more tests for images with no next, blurry text, and so on. But, I don't know how good Tesseract is. Steps to apply this patch: install Tesseract [1], apply the patch, move the test files into tika-parsers/src/test/resources/test-documents/ocr. Run the tests with mvn test -Dtest=org.apache.tika.parser.ocr.TesseractOCRTest -DfailIfNoTests=false . What needs to happen from here? How should we include Tesseract in the sources? How should we handle timeouts (give the user a warning that OCR can be slow/timed out)? [1] - https://code.google.com/p/tesseract-ocr/wiki/ReadMe
        Hide
        Luis Filipe Nassif added a comment -

        Thank you very much Tyler Palsulich for including unit tests! We could also include tests for normal images (not embedded).

        There is a simple timeout control that throws a TikaException with specific message if it happens. The idea to force setting a TesseractOCRConfig object in parseContext to run OCR is to not affect users that do not want OCR, exactly because it could take seconds, even minutes. So TesseractOCRParser can be included in Tika Parser list by default with no problem. We also could include a warning about OCR slowness in the class description.

        I have no idea how to include Tesseract in the sources. Maybe Tika commiters can help with this?

        Show
        Luis Filipe Nassif added a comment - Thank you very much Tyler Palsulich for including unit tests! We could also include tests for normal images (not embedded). There is a simple timeout control that throws a TikaException with specific message if it happens. The idea to force setting a TesseractOCRConfig object in parseContext to run OCR is to not affect users that do not want OCR, exactly because it could take seconds, even minutes. So TesseractOCRParser can be included in Tika Parser list by default with no problem. We also could include a warning about OCR slowness in the class description. I have no idea how to include Tesseract in the sources. Maybe Tika commiters can help with this?
        Hide
        Tyler Palsulich added a comment -

        Minor updates to the patch: Moved the OCRParser to tika-parsers (unless others think it should be in tika-core?), moved the files from test-documents/ocr to just test-documents.
        In PDFParserTest, I added testOCR.pdf to the list of known metadataDiff, since the PDF version is different for the NonSeq and Seq PDFBox parsers.

        In tika-server TikaMimeTypesTest, I changed testGetJSON() – will someone look at this part? Something seems weird about it.

        There still needs to be a check for if Tesseract is installed, and where. I looked a bit at the ExternalParser code – it seems useful, but I'm not sure how to combine TesseractOCRParser and ExternalParser. Can someone else chime in? At this point, I don't think we need more than a call to ExternalParser.check(). But, I could be wrong.

        In my opinion, we should just require that Tesseract be on the user's path. It's an uncommon program. So, if a user installs it, it will probably be for Tika OCR. So, it's not a big deal for them to put it on their path.

        I put up a review: https://reviews.apache.org/r/22402/. I don't think this is ready yet, but I'd like to get it moving.

        Show
        Tyler Palsulich added a comment - Minor updates to the patch: Moved the OCRParser to tika-parsers (unless others think it should be in tika-core?), moved the files from test-documents/ocr to just test-documents. In PDFParserTest, I added testOCR.pdf to the list of known metadataDiff, since the PDF version is different for the NonSeq and Seq PDFBox parsers. In tika-server TikaMimeTypesTest, I changed testGetJSON() – will someone look at this part? Something seems weird about it. There still needs to be a check for if Tesseract is installed, and where. I looked a bit at the ExternalParser code – it seems useful, but I'm not sure how to combine TesseractOCRParser and ExternalParser. Can someone else chime in? At this point, I don't think we need more than a call to ExternalParser.check(). But, I could be wrong. In my opinion, we should just require that Tesseract be on the user's path. It's an uncommon program. So, if a user installs it, it will probably be for Tika OCR. So, it's not a big deal for them to put it on their path. I put up a review: https://reviews.apache.org/r/22402/ . I don't think this is ready yet, but I'd like to get it moving.
        Hide
        Luis Filipe Nassif added a comment -

        Hi Tyler Palsulich,

        I think the option to configure tesseract path is very useful. For example, I can distribute tesseract binaries together with my app and do not need to change environment variables on the end user os.

        Show
        Luis Filipe Nassif added a comment - Hi Tyler Palsulich , I think the option to configure tesseract path is very useful. For example, I can distribute tesseract binaries together with my app and do not need to change environment variables on the end user os.
        Hide
        Chris A. Mattmann added a comment -

        Thanks Tyler I will look at the review board now!

        Show
        Chris A. Mattmann added a comment - Thanks Tyler I will look at the review board now!

          People

          • Assignee:
            Chris A. Mattmann
            Reporter:
            Jukka Zitting
          • Votes:
            11 Vote for this issue
            Watchers:
            27 Start watching this issue

            Dates

            • Created:
              Updated:

              Development