Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2175

Enable extraction of inlined jp2/jpx from PDF

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      On TIKA-2174, Matthew Caruana Galizia reported that inline jp2 images in PDFs were not being OCR'd. TIKA-2174 added that file type to our tesseract parser, but we our code in the PDFParser wasn't extracting the inline images as well. Let's fix that.

      1. pdf-with-jp2-images.pdf
        56 kB
        Matthew Caruana Galizia

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        I think we're good now. Matthew Caruana Galizia, thank you, again for opening this and other tix. Let us know what else you find. Also, please reopen, of course, if this isn't actually fixed for you.

        Show
        tallison@mitre.org Tim Allison added a comment - I think we're good now. Matthew Caruana Galizia , thank you, again for opening this and other tix. Let us know what else you find. Also, please reopen, of course, if this isn't actually fixed for you.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        LOL...TIKA-2169 wasn't the problem. I was bitten, yet again, by TIKA-2096.

                PDFParserConfig config = new PDFParserConfig();
                config.setExtractInlineImages(true);
                ParseContext pc = new ParseContext();
                pc.set(PDFParserConfig.class, config);
                pc.set(Parser.class, new AutoDetectParser()); //DO NOT FORGET!!!
                System.out.println(getXML("pdf-with-jp2-images.pdf", pc).xml);
        
        Show
        tallison@mitre.org Tim Allison added a comment - LOL... TIKA-2169 wasn't the problem. I was bitten, yet again, by TIKA-2096 . PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); ParseContext pc = new ParseContext(); pc.set(PDFParserConfig.class, config); pc.set(Parser.class, new AutoDetectParser()); //DO NOT FORGET!!! System.out.println(getXML("pdf-with-jp2-images.pdf", pc).xml);
        Hide
        mcaruanagalizia Matthew Caruana Galizia added a comment -

        The problem was OpenCL support in Tesseract. Once I rebuilt Tesseract without OpenCL support, I got the same results as you above, but using setExtractInlineImages(true) instead of setOcrStrategy(...). Thank you for testing.

        Show
        mcaruanagalizia Matthew Caruana Galizia added a comment - The problem was OpenCL support in Tesseract. Once I rebuilt Tesseract without OpenCL support, I got the same results as you above, but using setExtractInlineImages(true) instead of setOcrStrategy(...). Thank you for testing.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Wait, ok, this option is not working:

                PDFParserConfig config = new PDFParserConfig();
                config.setExtractInlineImages(true);
                ParseContext context = new ParseContext();
                context.set(PDFParserConfig.class, config);
                System.out.println(getXML("pdf-with-jp2-images.pdf", context).xml);
        

        However, I am getting OCR content (with bad html tags!) with this:

                PDFParserConfig config = new PDFParserConfig();
                config.setExtractInlineImages(true);
                ParseContext context = new ParseContext();
                context.set(PDFParserConfig.class, config);
                debug(getRecursiveMetadata("pdf-with-jp2-images.pdf", context));
        

        I think this will be fixed once I get around to TIKA-2169.

        Show
        tallison@mitre.org Tim Allison added a comment - Wait, ok, this option is not working: PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); ParseContext context = new ParseContext(); context.set(PDFParserConfig.class, config); System.out.println(getXML("pdf-with-jp2-images.pdf", context).xml); However, I am getting OCR content (with bad html tags!) with this: PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); ParseContext context = new ParseContext(); context.set(PDFParserConfig.class, config); debug(getRecursiveMetadata("pdf-with-jp2-images.pdf", context)); I think this will be fixed once I get around to TIKA-2169 .
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Hmmm....This is working for me (at least in our test suite)

            @Test
            public void testjp2() throws Exception {
                PDFParserConfig config = new PDFParserConfig();
                config.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
                ParseContext context = new ParseContext();
                context.set(PDFParserConfig.class, config);
                System.out.println(getXML("pdf-with-jp2-images.pdf", context).xml);
            }
        
        

        yields:

        <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
        <meta name="date" content="2015-12-28T14:25:23Z" />
        <meta name="pdf:PDFVersion" content="1.7" />
        <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
        <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
        <meta name="X-Parsed-By" content="class org.apache.tika.parser.ocr.TesseractOCRParser" />
        <meta name="xmp:CreatorTool" content="Nitro Pro" />
        <meta name="access_permission:modify_annotations" content="true" />
        <meta name="access_permission:can_print_degraded" content="true" />
        <meta name="access_permission:extract_for_accessibility" content="true" />
        <meta name="access_permission:assemble_document" content="true" />
        <meta name="xmpTPg:NPages" content="2" />
        <meta name="Last-Modified" content="2015-12-28T14:25:23Z" />
        <meta name="dcterms:modified" content="2015-12-28T14:25:23Z" />
        <meta name="dc:format" content="application/pdf; version=1.7" />
        <meta name="access_permission:extract_content" content="true" />
        <meta name="Last-Save-Date" content="2015-12-28T14:25:23Z" />
        <meta name="access_permission:can_print" content="true" />
        <meta name="pdf:docinfo:creator_tool" content="Nitro Pro" />
        <meta name="access_permission:fill_in_form" content="true" />
        <meta name="pdf:docinfo:modified" content="2015-12-28T14:25:23Z" />
        <meta name="meta:save-date" content="2015-12-28T14:25:23Z" />
        <meta name="pdf:encrypted" content="false" />
        <meta name="modified" content="2015-12-28T14:25:23Z" />
        <meta name="access_permission:can_modify" content="true" />
        <meta name="Content-Type" content="application/pdf" />
        <title></title>
        </head>
        <body><div class="page"><p />
        <div class="ocr">r13.3mm] fie G’hile
        
        CERTIFICADO
        
        El Banco dc Chile. oficinn QUILLOTAA Confirm que cl Sr. Algiandm Rodrigo Pnlmn Perez.
        Rul: 9,582.807-8. cs lilular dc la Cucmu Curricula MIN asigmda con el N" 1404810008
        vigcnlc dcsdc nl ox dc Fuhrcm dc I991. Bien llevada.
        
        Dames lu pneseme confirmacion. a pcdidn dcl inlencsado sin ulterior responxubifldfld para el
        Banco (I: Chile.
        
           
         
        
        e
        Bani“ it I ma
        Fl“)
        ENR‘OUE ‘22:“:
        
        aumou
        
        Samiago. ()3 dc Oczubrc dc 21114.
        
        </div>
        </div>
        <div class="page"><p />
        <div class="ocr">W
        BANK OF CHILE
        
        CERTIFICATE
        
        The Bank ofChilc. office in QUILLOTAV hereby confirms that Mr. Alejandro Rodrigo Palma Pen; with
        Tax Payer Regisu'ation No. 9.582.807-8 is Ill: holder of: current mun! No. Ida—48200438, nclive since
        3"“ February I991 showing a sound performance.
        
        We issue this ocnificalicm a! the request oflhc inleteslcd puny and it emails any liabilily for Bank of
        Chile.
        
        (Signature illegible)
        For: Bank of Chile
        The seal of Bank ofC'hileV ENRIQUE MARFIL ILABACA has been slumped herein)
        
        Santiago. 3" October 20 Hr
        
         
        
         
        
         
        
         
        
         
        
        </div>
        </div>
        </body></html>
        Show
        tallison@mitre.org Tim Allison added a comment - Hmmm....This is working for me (at least in our test suite) @Test public void testjp2() throws Exception { PDFParserConfig config = new PDFParserConfig(); config.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION); ParseContext context = new ParseContext(); context.set(PDFParserConfig.class, config); System.out.println(getXML("pdf-with-jp2-images.pdf", context).xml); } yields: <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="date" content="2015-12-28T14:25:23Z" /> <meta name="pdf:PDFVersion" content="1.7" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" /> <meta name="X-Parsed-By" content="class org.apache.tika.parser.ocr.TesseractOCRParser" /> <meta name="xmp:CreatorTool" content="Nitro Pro" /> <meta name="access_permission:modify_annotations" content="true" /> <meta name="access_permission:can_print_degraded" content="true" /> <meta name="access_permission:extract_for_accessibility" content="true" /> <meta name="access_permission:assemble_document" content="true" /> <meta name="xmpTPg:NPages" content="2" /> <meta name="Last-Modified" content="2015-12-28T14:25:23Z" /> <meta name="dcterms:modified" content="2015-12-28T14:25:23Z" /> <meta name="dc:format" content="application/pdf; version=1.7" /> <meta name="access_permission:extract_content" content="true" /> <meta name="Last-Save-Date" content="2015-12-28T14:25:23Z" /> <meta name="access_permission:can_print" content="true" /> <meta name="pdf:docinfo:creator_tool" content="Nitro Pro" /> <meta name="access_permission:fill_in_form" content="true" /> <meta name="pdf:docinfo:modified" content="2015-12-28T14:25:23Z" /> <meta name="meta:save-date" content="2015-12-28T14:25:23Z" /> <meta name="pdf:encrypted" content="false" /> <meta name="modified" content="2015-12-28T14:25:23Z" /> <meta name="access_permission:can_modify" content="true" /> <meta name="Content-Type" content="application/pdf" /> <title></title> </head> <body><div class="page"><p /> <div class="ocr">r13.3mm] fie G’hile CERTIFICADO El Banco dc Chile. oficinn QUILLOTAA Confirm que cl Sr. Algiandm Rodrigo Pnlmn Perez. Rul: 9,582.807-8. cs lilular dc la Cucmu Curricula MIN asigmda con el N" 1404810008 vigcnlc dcsdc nl ox dc Fuhrcm dc I991. Bien llevada. Dames lu pneseme confirmacion. a pcdidn dcl inlencsado sin ulterior responxubifldfld para el Banco (I: Chile. e Bani“ it I ma Fl“) ENR‘OUE ‘22:“: aumou Samiago. ()3 dc Oczubrc dc 21114. </div> </div> <div class="page"><p /> <div class="ocr">W BANK OF CHILE CERTIFICATE The Bank ofChilc. office in QUILLOTAV hereby confirms that Mr. Alejandro Rodrigo Palma Pen; with Tax Payer Regisu'ation No. 9.582.807-8 is Ill: holder of: current mun! No. Ida—48200438, nclive since 3"“ February I991 showing a sound performance. We issue this ocnificalicm a! the request oflhc inleteslcd puny and it emails any liabilily for Bank of Chile. (Signature illegible) For: Bank of Chile The seal of Bank ofC'hileV ENRIQUE MARFIL ILABACA has been slumped herein) Santiago. 3" October 20 Hr </div> </div> </body></html>
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Will take a look. Thank you for sharing a file.

        Show
        tallison@mitre.org Tim Allison added a comment - Will take a look. Thank you for sharing a file.
        Hide
        mcaruanagalizia Matthew Caruana Galizia added a comment -

        Still no joy, both with my bridge classes and with tika-app from trunk. It seems the images in the PDF are skipped over entirely. I don't think that the embedded document parsing handler is ever even invoked. I've attached the PDF in question. If you open it in a hex editor, you can see that the files are declared to be "jp2" format.

        Show
        mcaruanagalizia Matthew Caruana Galizia added a comment - Still no joy, both with my bridge classes and with tika-app from trunk. It seems the images in the PDF are skipped over entirely. I don't think that the embedded document parsing handler is ever even invoked. I've attached the PDF in question. If you open it in a hex editor, you can see that the files are declared to be "jp2" format.
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build Tika-trunk #1140 (See https://builds.apache.org/job/Tika-trunk/1140/)
        TIKA-2174/TIKA-2175 – clean up (tallison: rev b97045aea303bac75bd3c937cde6b42c7a3b3c48)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build Tika-trunk #1140 (See https://builds.apache.org/job/Tika-trunk/1140/ ) TIKA-2174 / TIKA-2175 – clean up (tallison: rev b97045aea303bac75bd3c937cde6b42c7a3b3c48) (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Hmmm. OCR via inline images works on this file from PDFBOX-1067. Let me know what you find on your files.

        Show
        tallison@mitre.org Tim Allison added a comment - Hmmm. OCR via inline images works on this file from PDFBOX-1067 . Let me know what you find on your files.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        After I made the change recommended by Tilman Hausherr, I'm able to run OCR via extraction of the inline jpx on this file without the bridge classes.

        Can you try trunk against your test file(s)? My "test" with the linked file renamed to testOCR_jp2.pdf looked like this:

            @Test
            public void testOneOff() throws Exception {
                ParseContext context = new ParseContext();
                PDFParserConfig parserConfig = new PDFParserConfig();
                parserConfig.setExtractInlineImages(true);
                context.set(PDFParserConfig.class, parserConfig);
                debug(getRecursiveMetadata("testOCR_jp2.pdf", context));
            }
        

        Or, are the bridge classes necessary for JBIG2, but jpx works ok for you?

        Show
        tallison@mitre.org Tim Allison added a comment - - edited After I made the change recommended by Tilman Hausherr , I'm able to run OCR via extraction of the inline jpx on this file without the bridge classes. Can you try trunk against your test file(s)? My "test" with the linked file renamed to testOCR_jp2.pdf looked like this: @Test public void testOneOff() throws Exception { ParseContext context = new ParseContext(); PDFParserConfig parserConfig = new PDFParserConfig(); parserConfig.setExtractInlineImages(true); context.set(PDFParserConfig.class, parserConfig); debug(getRecursiveMetadata("testOCR_jp2.pdf", context)); } Or, are the bridge classes necessary for JBIG2, but jpx works ok for you?
        Hide
        mcaruanagalizia Matthew Caruana Galizia added a comment -

        I've filed an issue with the jpeg2000 imageio project to declare jpx support. The decode/encoders support that format - the issue is simply that it's not declared so PDFBox doesn't find them.

        As a temporary workaround and proof of concept I've added these two bridge Spi classes: https://github.com/ICIJ/extract/tree/master/src/main/java/org/icij/imageio/jpx

        Show
        mcaruanagalizia Matthew Caruana Galizia added a comment - I've filed an issue with the jpeg2000 imageio project to declare jpx support. The decode/encoders support that format - the issue is simply that it's not declared so PDFBox doesn't find them. As a temporary workaround and proof of concept I've added these two bridge Spi classes: https://github.com/ICIJ/extract/tree/master/src/main/java/org/icij/imageio/jpx
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1138 (See https://builds.apache.org/job/Tika-trunk/1138/)
        TIKA-2175 – add extraction for inline jp2/jpx from PDFParser (tallison: rev 91cdce43d22cd6726375a83c7842fa299035a258)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
        • (edit) tika-parsers/pom.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1138 (See https://builds.apache.org/job/Tika-trunk/1138/ ) TIKA-2175 – add extraction for inline jp2/jpx from PDFParser (tallison: rev 91cdce43d22cd6726375a83c7842fa299035a258) (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java (edit) tika-parsers/pom.xml
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Fixed this in trunk. Will fix in 2.x tomorrow. Thank you, Tilman Hausherr, for the solution on the PDFBox users list!

        Show
        tallison@mitre.org Tim Allison added a comment - Fixed this in trunk. Will fix in 2.x tomorrow. Thank you, Tilman Hausherr , for the solution on the PDFBox users list!

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development