Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3284

Big Pdf parsing to text - Out of memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.10, 1.8.11, 2.0.0, 3.0.0 PDFBox
    • 3.0.0 PDFBox
    • Parsing
    • None

    Description

      I'm trying to parse a quite big PDF (26MB) and transform it to text, however I'm facing a huge memory consumption leading to out of memory error. Running my test with -Xmx768M will always fail. I've to increase to 1500M to make it work.
      The resulting text is only 3MB so I don't understand why it is taking so much memory.

      I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.

      The pdf can be found here

      My code:

      Test.java
      @Test
      public void testParsePdf_Content_Memory() throws Exception {
      {
          InputStream inputStream = new FileInputStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
          try {
                   StringWriter writer = new StringWriter();
      	     FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt"));
      
                   PDFTextStripper pdfTextStripper = new PDFTextStripper();
      	     pdfTextStripper.writeText(PDDocument.load(inputStream), fileWriter);
      
                   fileWriter.close();
          } finally {
              inputStream.close();
          }
      }
      

      Attachments

        1. massparse-stat.txt
          331 kB
          Tilman Hausherr

        Issue Links

          Activity

            People

              Unassigned Unassigned
              multanis Nicolas Daniels
              Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: