Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1742

StackOverflowError parsing a PDF with ExtractInlineImages=true

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.10
    • None
    • parser
    • None

    Description

      Here's the file:
      http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf

      Code to repro (ExtractInlineImages must be true):

          Parser parser = new PDFParser();
          Metadata metadata = new Metadata();
          ParseContext context = new ParseContext();
          PDFParserConfig config = new PDFParserConfig();
          ContentHandler handler = new DefaultHandler();
      
          config.setExtractInlineImages(true);
          config.setExtractUniqueInlineImagesOnly(false);
      
          context.set(PDFParserConfig.class, config);
          context.set(Parser.class, parser);
      
          InputStream is = new BufferedInputStream(new FileInputStream(args[0]));
          try {
            parser.parse(is, handler, metadata, context);
          } finally {
            is.close();
          }
      

      Error (infinite recursion in extractImages):

      Exception in thread "main" java.lang.StackOverflowError
      	at java.util.LinkedHashMap$Entry.addBefore(LinkedHashMap.java:340)
      	at java.util.LinkedHashMap$Entry.access$600(LinkedHashMap.java:320)
      	at java.util.LinkedHashMap.createEntry(LinkedHashMap.java:444)
      	at java.util.HashMap.addEntry(HashMap.java:888)
      	at java.util.LinkedHashMap.addEntry(LinkedHashMap.java:427)
      	at java.util.HashMap.put(HashMap.java:509)
      	at org.apache.pdfbox.cos.COSDictionary.setItem(COSDictionary.java:246)
      	at org.apache.pdfbox.pdmodel.common.COSDictionaryMap.convert(COSDictionaryMap.java:206)
      	at org.apache.pdfbox.pdmodel.PDResources.setXObjects(PDResources.java:331)
      	at org.apache.pdfbox.pdmodel.PDResources.getXObjects(PDResources.java:269)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:310)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      	at org.apache.tika.parser.pdf.PDF2XHTML.extractImages(PDF2XHTML.java:319)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              nated Nate Dire
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: