Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1742

StackOverflowError parsing a PDF with ExtractInlineImages=true

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.10
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      Here's the file:
      http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

      Code to repro (ExtractInlineImages must be true):

          Parser parser = new PDFParser();
          Metadata metadata = new Metadata();
          ParseContext context = new ParseContext();
          PDFParserConfig config = new PDFParserConfig();
          ContentHandler handler = new DefaultHandler();
      
          config.setExtractInlineImages(true);
          config.setExtractUniqueInlineImagesOnly(false);
      
          context.set(PDFParserConfig.class, config);
          context.set(Parser.class, parser);
      
          InputStream is = new BufferedInputStream(new FileInputStream(args[0]));
          try {
            parser.parse(is, handler, metadata, context);
          } finally {
            is.close();
          }
      

      Error (infinite recursion in extractImages):

      Exception in thread "main" java.lang.StackOverflowError
      	at java.util.LinkedHashMap$Entry.addBefore(LinkedHashMap.java:340)
      	at java.util.LinkedHashMap$Entry.access$600(LinkedHashMap.java:320)
      	at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:444)
      	at java.util.HashMap.addEntry(HashMap.java:888)
      	at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
      	at java.util.HashMap.put(HashMap.java:509)
      	at org.apache.pdfbox.cos.COSDictionary.setItem(COSDictionary.java:246)
      	at org.apache.pdfbox.pdmodel.common.COSDictionaryMap.convert(COSDictionaryMap.java:206)
      	at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:331)
      	at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:310)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          The HORROR! If it were a second rate conference, it would be one thing... (not sarcasm, I'm an nlp'er).

          I confirmed that this same bug happens in pure PDFBox 1.8.11-SNAPSHOT's ExtractImages. I opened PDFBOX-2988.

          Show
          tallison@mitre.org Tim Allison added a comment - The HORROR! If it were a second rate conference, it would be one thing... (not sarcasm, I'm an nlp'er). I confirmed that this same bug happens in pure PDFBox 1.8.11-SNAPSHOT's ExtractImages. I opened PDFBOX-2988 .
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Tilman Hausherr fixed this over in PDFBox 1.8.x (already not an issue in PDFBox trunk). Any problems if copy Tilman Hausherr's strategy and only export one copy of an image per page?

          Show
          tallison@mitre.org Tim Allison added a comment - Tilman Hausherr fixed this over in PDFBox 1.8.x (already not an issue in PDFBox trunk). Any problems if copy Tilman Hausherr 's strategy and only export one copy of an image per page?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          r1706086

          Show
          tallison@mitre.org Tim Allison added a comment - r1706086
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you, Nate Dire, for raising this, and thank you Tilman Hausherr for a solution!

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you, Nate Dire , for raising this, and thank you Tilman Hausherr for a solution!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #860 (See https://builds.apache.org/job/tika-trunk-jdk1.7/860/)
          clean up from TIKA-1742 and TIKA-1748 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706092)

          • trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
            TIKA-1742 prevent infinite recursion while processing inline images in PDFs by limiting extraction to unique images per page...following Tilman Hausherr's solution on PDFBox (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706086)
          • trunk/CHANGES.txt
          • trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
          • trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #860 (See https://builds.apache.org/job/tika-trunk-jdk1.7/860/ ) clean up from TIKA-1742 and TIKA-1748 (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706092 ) trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java TIKA-1742 prevent infinite recursion while processing inline images in PDFs by limiting extraction to unique images per page...following Tilman Hausherr's solution on PDFBox (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1706086 ) trunk/CHANGES.txt trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java

            People

            • Assignee:
              Unassigned
              Reporter:
              nated Nate Dire
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development