Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.10
-
None
-
None
Description
Here's the file:
http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf
Code to repro (ExtractInlineImages must be true):
Parser parser = new PDFParser(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); PDFParserConfig config = new PDFParserConfig(); ContentHandler handler = new DefaultHandler(); config.setExtractInlineImages(true); config.setExtractUniqueInlineImagesOnly(false); context.set(PDFParserConfig.class, config); context.set(Parser.class, parser); InputStream is = new BufferedInputStream(new FileInputStream(args[0])); try { parser.parse(is, handler, metadata, context); } finally { is.close(); }
Error (infinite recursion in extractImages):
Exception in thread "main" java.lang.StackOverflowError at java.util.LinkedHashMap$Entry.addBefore(LinkedHashMap.java:340) at java.util.LinkedHashMap$Entry.access$600(LinkedHashMap.java:320) at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:444) at java.util.HashMap.addEntry(HashMap.java:888) at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427) at java.util.HashMap.put(HashMap.java:509) at org.apache.pdfbox.cos.COSDictionary.setItem(COSDictionary.java:246) at org.apache.pdfbox.pdmodel.common.COSDictionaryMap.convert(COSDictionaryMap.java:206) at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:331) at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:310) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319) at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
Attachments
Issue Links
- depends upon
-
PDFBOX-2988 Infinite recursion in ExtractImages 1.8.11-SNAPSHOT
- Closed
- relates to
-
TIKA-1038 Parsing PDF with StackOverlowError
- Resolved