Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.14
    • Component/s: None
    • Labels:
      None

      Description

      Users can now run OCR on individual images embedded inline in PDFs if they get the configuration right.

      There are some drawbacks: 1) the text appears as an attachment if using the RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully rendered page instead of on the individual images (this is still tbd).

      It might be useful to run OCR against each rendered page (instead of the component images).

      Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912). This will allow us to experiment with strategies until the cleaner integration is available with PDFBox 2.1.

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Pushed to trunk. Still have to integrate with 2.x.

          I made some modifications to the TesseractOCRParser. Let me know if there are any concerns there.

          Show
          tallison@mitre.org Tim Allison added a comment - Pushed to trunk. Still have to integrate with 2.x. I made some modifications to the TesseractOCRParser. Let me know if there are any concerns there.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #1005 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1005/)
          TIKA-1994 – integrate OCR with PDFParser (tallison: rev 7aeb95d6c7a6ac3611f2dd975baa73f566631061)

          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java
          • tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
          • tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          • tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #1005 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1005/ ) TIKA-1994 – integrate OCR with PDFParser (tallison: rev 7aeb95d6c7a6ac3611f2dd975baa73f566631061) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          Hide
          lfcnassif Luis Filipe Nassif added a comment - - edited

          Hi Tim,

          Before the PDFBox deeper integration (good to know they are working on that!), I think this strategy is very good, and currently we use it in my organization instead of OCRing individual images inside a pdf. As you know, PDFs may have one image per paragraph, line, word or per char, and that can result in poor results with the individual image ocr approach.

          As a suggestion, we count the number of extracted text chars per page and only do ocr if it is lower than a configurable value (we use 100 by default), because it suggests a high chance that the page is formed by a big (scanned) image. That eliminates lots of duplicate info that would be returned by ocr and speeds up the extraction a lot.

          Show
          lfcnassif Luis Filipe Nassif added a comment - - edited Hi Tim, Before the PDFBox deeper integration (good to know they are working on that!), I think this strategy is very good, and currently we use it in my organization instead of OCRing individual images inside a pdf. As you know, PDFs may have one image per paragraph, line, word or per char, and that can result in poor results with the individual image ocr approach. As a suggestion, we count the number of extracted text chars per page and only do ocr if it is lower than a configurable value (we use 100 by default), because it suggests a high chance that the page is formed by a big (scanned) image. That eliminates lots of duplicate info that would be returned by ocr and speeds up the extraction a lot.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Sounds like a great strategy. That will catch image-only (or image-mostly) pages. Let's open a separate issue to track other strategies beyond what I initially put in.

          because it suggests a high chance that the page is formed by a big (scanned) image

          Note that we process the page and then run OCR (if the strategy is ocr+text). We could gather info about the size/number of the images before making the determination.

          speeds up the extraction a lot.

          Y, I have to admit, I've been really impressed by the quality of Tesseract (on English, at least)...but the speed is an area of concern.

          I'm hoping to run "ocr_only" against some of our corpus over the weekend and compare that with "no_ocr." In addition to 'run ocr if there's only a little text', it would be neat to be able to run ocr if there is 'bad text' (TIKA-1443).

          Have you done any experiments on dpi setting/image format/image type on OCR performance? Does 200 dpi PNG GRAY do better than 200 dpi JPEG RGB...for example?

          Show
          tallison@mitre.org Tim Allison added a comment - Sounds like a great strategy. That will catch image-only (or image-mostly) pages. Let's open a separate issue to track other strategies beyond what I initially put in. because it suggests a high chance that the page is formed by a big (scanned) image Note that we process the page and then run OCR (if the strategy is ocr+text). We could gather info about the size/number of the images before making the determination. speeds up the extraction a lot. Y, I have to admit, I've been really impressed by the quality of Tesseract (on English, at least)...but the speed is an area of concern. I'm hoping to run "ocr_only" against some of our corpus over the weekend and compare that with "no_ocr." In addition to 'run ocr if there's only a little text', it would be neat to be able to run ocr if there is 'bad text' ( TIKA-1443 ). Have you done any experiments on dpi setting/image format/image type on OCR performance? Does 200 dpi PNG GRAY do better than 200 dpi JPEG RGB...for example?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          And one other question... I know that Tesseract can do script detection via the commandline. Is there any way at all to do language detection so that you can pass in the right language model?

          Show
          tallison@mitre.org Tim Allison added a comment - And one other question... I know that Tesseract can do script detection via the commandline. Is there any way at all to do language detection so that you can pass in the right language model?
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #1006 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1006/)
          TIKA-1994 – integrate OCR with PDFParser, update CHANGES.txt (tallison: rev 1af1078adcb746fced8c71e4afe5b4d008a3f6b8)

          • CHANGES.txt
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #1006 (See https://builds.apache.org/job/tika-trunk-jdk1.7/1006/ ) TIKA-1994 – integrate OCR with PDFParser, update CHANGES.txt (tallison: rev 1af1078adcb746fced8c71e4afe5b4d008a3f6b8) CHANGES.txt
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Bob Paulin, in 2.0, if we keep the current set up, PDFParser will now have to depend on tika-parser-multimedia-module. Not too awful, but another intermodule dependency that I'd prefer not to add.

          I thought about moving the TesseractOCRParser into its own module, but it currently depends on the image parsers for metadata (thanks to my complaints ). I think by the time 2.0 is ready, we'll get rid of that dependency and let the user choose to combine OCR+image metadata (once we can combine parsers)...so, down the road, I think it might make sense to break the ocr parser into its own module.

          Thoughts, obvious solutions?

          Show
          tallison@mitre.org Tim Allison added a comment - Bob Paulin , in 2.0, if we keep the current set up, PDFParser will now have to depend on tika-parser-multimedia-module. Not too awful, but another intermodule dependency that I'd prefer not to add. I thought about moving the TesseractOCRParser into its own module, but it currently depends on the image parsers for metadata (thanks to my complaints ). I think by the time 2.0 is ready, we'll get rid of that dependency and let the user choose to combine OCR+image metadata (once we can combine parsers)...so, down the road, I think it might make sense to break the ocr parser into its own module. Thoughts, obvious solutions?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Initial fix is added. Let's open a new issue to track improvements to the strategies.

          Show
          tallison@mitre.org Tim Allison added a comment - Initial fix is added. Let's open a new issue to track improvements to the strategies.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in tika-2.x-windows #11 (See https://builds.apache.org/job/tika-2.x-windows/11/)
          TIKA-1994 – Integrate TesseractOCR with full page image rendering for (tallison: rev ebe70289815776f6ce6c271c7faf8d23cfd31337)

          • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          • CHANGES.txt
          • tika-parser-modules/tika-parser-multimedia-module/pom.xml
          • tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
          • tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
          • tika-parser-modules/tika-parser-pdf-module/pom.xml
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • tika-parser-bundles/tika-parser-journal-bundle/src/test/java/org/apache/tika/module/journal/BundleIT.java
          • tika-parser-bundles/tika-parser-pdf-bundle/src/test/java/org/apache/tika/module/pdf/BundleIT.java
          • tika-parser-bundles/tika-parser-pdf-bundle/pom.xml
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #11 (See https://builds.apache.org/job/tika-2.x-windows/11/ ) TIKA-1994 – Integrate TesseractOCR with full page image rendering for (tallison: rev ebe70289815776f6ce6c271c7faf8d23cfd31337) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java CHANGES.txt tika-parser-modules/tika-parser-multimedia-module/pom.xml tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java tika-parser-modules/tika-parser-pdf-module/pom.xml tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java tika-parser-bundles/tika-parser-journal-bundle/src/test/java/org/apache/tika/module/journal/BundleIT.java tika-parser-bundles/tika-parser-pdf-bundle/src/test/java/org/apache/tika/module/pdf/BundleIT.java tika-parser-bundles/tika-parser-pdf-bundle/pom.xml tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-2.x #107 (See https://builds.apache.org/job/tika-2.x/107/)
          TIKA-1994 – Integrate TesseractOCR with full page image rendering for (tallison: rev ebe70289815776f6ce6c271c7faf8d23cfd31337)

          • tika-parser-bundles/tika-parser-pdf-bundle/src/test/java/org/apache/tika/module/pdf/BundleIT.java
          • tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • tika-parser-modules/tika-parser-multimedia-module/pom.xml
          • tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
          • CHANGES.txt
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
          • tika-parser-bundles/tika-parser-journal-bundle/src/test/java/org/apache/tika/module/journal/BundleIT.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • tika-parser-modules/tika-parser-pdf-module/pom.xml
          • tika-parser-bundles/tika-parser-pdf-bundle/pom.xml
          • tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #107 (See https://builds.apache.org/job/tika-2.x/107/ ) TIKA-1994 – Integrate TesseractOCR with full page image rendering for (tallison: rev ebe70289815776f6ce6c271c7faf8d23cfd31337) tika-parser-bundles/tika-parser-pdf-bundle/src/test/java/org/apache/tika/module/pdf/BundleIT.java tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties tika-parser-modules/tika-parser-multimedia-module/pom.xml tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java CHANGES.txt tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java tika-parser-bundles/tika-parser-journal-bundle/src/test/java/org/apache/tika/module/journal/BundleIT.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java tika-parser-modules/tika-parser-pdf-module/pom.xml tika-parser-bundles/tika-parser-pdf-bundle/pom.xml tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Using info like number and size of imagens per page before decision would be great.

          Yes, I have done some experiments a few years ago about these settings (150x200x300dpi, b&w x gray x rgb). Tesseract suggests 300dpi for 10 point fonts, but I got very good results and speed with 200dpi grayscale with my very limited corpus (portuguese language, font size larger than 10p) that time. Png format is better than jpeg, it is lossless, has less noise and is recommended by tesseract too.

          Show
          lfcnassif Luis Filipe Nassif added a comment - Using info like number and size of imagens per page before decision would be great. Yes, I have done some experiments a few years ago about these settings (150x200x300dpi, b&w x gray x rgb). Tesseract suggests 300dpi for 10 point fonts, but I got very good results and speed with 200dpi grayscale with my very limited corpus (portuguese language, font size larger than 10p) that time. Png format is better than jpeg, it is lossless, has less noise and is recommended by tesseract too.
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Hum I don't know, I have never tried that.

          Show
          lfcnassif Luis Filipe Nassif added a comment - Hum I don't know, I have never tried that.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Great. Default is 200dpi gray png. I guessed right. Thank you!

          I just kicked off the ocr_only run against ~300k pdfs in our corpus...looks like it might take a few days to complete. That'll give us a baseline at least.

          Show
          tallison@mitre.org Tim Allison added a comment - Great. Default is 200dpi gray png. I guessed right. Thank you! I just kicked off the ocr_only run against ~300k pdfs in our corpus...looks like it might take a few days to complete. That'll give us a baseline at least.

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development