Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1202

org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not A Problem
    • 1.6.0
    • None
    • Text extraction
    • None
    • Mac OS X 10.7.2

    Description

      Error "org.apache.pdfbox.filter.FlateFilter decode SEVERE: Stop reading corrupt stream" thrown when extracting text.

      The error is thrown at:

      • page 397 if the page loop starts at zero – for (int i = 0; i < allPages.size(); i++)
      • page 790 if the loop starts at 395 (that would make it approx. 397 pages from the beggining of the loop)
      • page 848 if the loop starts at 450 (that would make it aprox. 397 pages from the beggining of the loop)

      The error is not thrown if:

      • the loop starts at page 452 or later
      • the loop starts at 0 and ends before 396
      • the loop starts at 200 and ends before 595

      Therefore I suspect that a loop spanning more than 396 pages will throw an error. Is that an indication of a memory leak of some sort?

      Full code is below. Note that the result is the same when instantiating a single PDFTextStripperByArea outside the page loop and invoking resetEngine() on the stripper inside the page loop.

      import java.awt.geom.Rectangle2D;
      import java.io.IOException;
      import java.util.List;

      import org.apache.pdfbox.exceptions.COSVisitorException;
      import org.apache.pdfbox.exceptions.CryptographyException;
      import org.apache.pdfbox.exceptions.InvalidPasswordException;
      import org.apache.pdfbox.pdmodel.PDDocument;
      import org.apache.pdfbox.pdmodel.PDPage;
      import org.apache.pdfbox.util.PDFTextStripperByArea;

      public class Main {

      public static void main(String[] args) throws IOException,
      COSVisitorException, CryptographyException {

      PDDocument document = null;
      try {
      document = PDDocument.load("/Users/ilijapavlic/Desktop/IATAUnitedStates.pdf");
      if (document.isEncrypted()) {
      try

      { document.decrypt(""); }

      catch (InvalidPasswordException e)

      { System.err.println("Error: Document is encrypted with a password."); System.exit(1); }

      }

      float x = 55f;
      float y = 40f;
      float width = 168.5f;
      float height = 689f;
      float evenOffset = -10f;

      List allPages = document.getDocumentCatalog().getAllPages();

      for (int i = 0; i < allPages.size(); i++) {
      System.out.println("Page " + i);

      PDPage page = (PDPage) allPages.get;
      PDFTextStripperByArea stripper = new PDFTextStripperByArea();
      stripper.setSortByPosition(true);

      for (int j = 0; j < 3; j++)
      {
      if (i % 2 == 0)

      { Rectangle2D.Float region = new Rectangle2D.Float(x, y, width*3, height); stripper.addRegion("region", region); }

      else

      { Rectangle2D.Float region = new Rectangle2D.Float(x + evenOffset, y, width*3, height); stripper.addRegion("region", region); }

      }

      stripper.extractRegions(page);

      for (String regionName : stripper.getRegions())

      { stripper.getTextForRegion(regionName); }

      }
      }

      catch(Exception e)

      { e.printStackTrace(); }

      finally {
      if (document != null)

      { document.close(); }

      }
      }
      }

      Attachments

        1. IATAUnitedStates.pdf
          4.86 MB
          Ilija Pavlic

        Activity

          People

            lehmi Andreas Lehmkühler
            ipavlic Ilija Pavlic
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: