Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4201

Certain scanned pdfs do not render

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 2.0.8
    • None
    • Parsing
    • None

    Description

      I am using PDFBox version 2.0.8. I am trying to render scanned pdfs but there are a some that do not render and result in an error.  Native pdfs do not have any trouble rendering. The majority of the scanned pdfs that I have also do not have any trouble rendering but there are a couple that result in an error (one is attached).

      This is the code I used to render the pdf.

      try (PDDocument document = load(file)) {
          logger.debug("start generate image file " + pageNumber + " for " + name);
          PDFRenderer pdfRenderer = new PDFRenderer(document);
          return getPageImage(pdfRenderer, pageNumber, name, storageId);
      }

      The above call to getPageImage calls the following code 

      File imageFile = File.createTempFile(StringUtils.toFilename(storageId) + "_" + pageNumber, ".png");
      imageFile.deleteOnExit();
      
      final BufferedImage image = pdfRenderer.renderImageWithDPI(pageNumber - 1, dpi, ImageType.RGB);
      ImageIO.write(image, "png", imageFile);
      
      logger.debug("completed generate image file " + pageNumber + " for " + name);
      return imageFile;

      The issue occurs in the second code snippet in the line

      final BufferedImage image = pdfRenderer.renderImageWithDPI(pageNumber - 1, dpi, ImageType.RGB);

       

      The stack trace is the following

      Caused by: java.io.IOException: Error: Expected operator 'ID' actual='In'
      
      at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:305) ~[pdfbox-2.0.8.jar:2.0.8]
      
      at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502) ~[pdfbox-2.0.8.jar:2.0.8]
      
      at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469) ~[pdfbox-2.0.8.jar:2.0.8]
      
      at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) ~[pdfbox-2.0.8.jar:2.0.8]
      
      at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:203) ~[pdfbox-2.0.8.jar:2.0.8]
      
      at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:145) ~[pdfbox-2.0.8.jar:2.0.8]
      
      at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:94) ~[pdfbox-2.0.8.jar:2.0.8]
      
      at com.sustain.document.PdfPageGenerator.getPageImage(PdfPageGenerator.java:70) ~[classes/:?]
      
      at com.sustain.document.PdfPageGenerator.getPageImage(PdfPageGenerator.java:59) ~[classes/:?]
      

      Since rendering was not an issue with native pdfs I initially thought that only scanned pdfs were an issue. But after other scanned pdfs rendered, I am uncertain as to what could be causing some to render and some to error out.

      Attachments

        1. PDFBOX-4201-content-stream.txt
          43 kB
          Tilman Hausherr
        2. testDoc2_unc.pdf
          10.88 MB
          Tilman Hausherr
        3. testDoc2_unc-saved.pdf
          767 kB
          Tilman Hausherr
        4. testDoc2.pdf
          906 kB
          Antonio Contreras

        Activity

          People

            Unassigned Unassigned
            tony_jtech Antonio Contreras
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: