PDFBox
  1. PDFBox
  2. PDFBOX-911

Method PDDocument.getNumberOfPages() returns wrong number of pages

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 1.8.1
    • Component/s: None
    • Labels:
      None
    • Environment:
      Windows XP, Eclipse 3.5.2 Galileo

      Description

      Hello,

      I use PDFbox and are very pleased.
      For one PDF file however getNumberOfPages() returns the wrong number of pages (1 instead of 2 pages).

      Test code:
      File xx = new File("c:\\temp
      test.pdf");
      PDDocument pdoc = PDDocument.load(xx);
      int x = pdoc.getNumberOfPages();

      The PDF file could be provided.

      Thanks in advance.

      Regards

      1. test.unc.pdf
        202 kB
        Adam Nichols
      2. ASF.LICENSE.NOT.GRANTED--Martijn Brinkers.jpg
        205 kB
        nielsen
      3. ASF.LICENSE.NOT.GRANTED--atest.pdf
        151 kB
        nielsen
      4. ASF.LICENSE.NOT.GRANTED--test.pdf
        151 kB
        nielsen

        Issue Links

          Activity

          Hide
          Andreas Lehmkühler added a comment -

          Works fine using the non-sequential parser at least since 1.8.1

          Thanks for the report

          Show
          Andreas Lehmkühler added a comment - Works fine using the non-sequential parser at least since 1.8.1 Thanks for the report
          Hide
          Andreas Lehmkühler added a comment -

          The current parser stops reading when it reaches a startxref/%%EOF combination. In case of an updated pdf the content will be more or less concatenated and some of it will be after that combination.
          As Adam already mentioned we need a conforming parser, which starts at the end of the pdf, to solve this issue.

          Show
          Andreas Lehmkühler added a comment - The current parser stops reading when it reaches a startxref/%%EOF combination. In case of an updated pdf the content will be more or less concatenated and some of it will be after that combination. As Adam already mentioned we need a conforming parser, which starts at the end of the pdf, to solve this issue.
          Hide
          Adam Nichols added a comment -

          The problem here is related to PDFBOX-796. I wrote a program which loads a PDF, loops through each of the object streams and dereferences them & adds them to the PDF, and then removes the stream. This makes PDFs much easier to read and debug without changing any actual data. It's merely written in plain text instead of a compressed stream. However, if there are multiple objects with the same object ID and revision number (which is a violation of the PDF spec, but happens quite often in reality), my decompression program just overwrites the old one with the one from the object stream. This is not how the parser handles these situations, and thus why there's a problem with the original, but no problem with the uncompressed version. This is also where PDFBOX-796 comes into play as it explains that some files can not be processed if the first object is overwritten. This issue proves that there are cases where the opposite is true. The solution, as Andreas pointed out a few months ago, is to support incremental updates. It just so happens that PDFBOX-912 has some support for incremental updates, and I've been thinking about taking a stab at changing the parser to conform with the PDF spec (i.e. start at the end of the document and work backwards). I'll take a look at the patches contributed by Thomas in PDFBOX-912 and see what I can do about this issue.

          Show
          Adam Nichols added a comment - The problem here is related to PDFBOX-796 . I wrote a program which loads a PDF, loops through each of the object streams and dereferences them & adds them to the PDF, and then removes the stream. This makes PDFs much easier to read and debug without changing any actual data. It's merely written in plain text instead of a compressed stream. However, if there are multiple objects with the same object ID and revision number (which is a violation of the PDF spec, but happens quite often in reality), my decompression program just overwrites the old one with the one from the object stream. This is not how the parser handles these situations, and thus why there's a problem with the original, but no problem with the uncompressed version. This is also where PDFBOX-796 comes into play as it explains that some files can not be processed if the first object is overwritten. This issue proves that there are cases where the opposite is true. The solution, as Andreas pointed out a few months ago, is to support incremental updates. It just so happens that PDFBOX-912 has some support for incremental updates, and I've been thinking about taking a stab at changing the parser to conform with the PDF spec (i.e. start at the end of the document and work backwards). I'll take a look at the patches contributed by Thomas in PDFBOX-912 and see what I can do about this issue.
          Hide
          nielsen added a comment -

          Hello Adam,
          Here a hint which might help you:
          I suppose the problem in test.pdf was caused how page 2,3 and 4 were inserted.
          Originally the test file was an export from Word (via plugin) with many pages. Then all pages except the first page were deleted
          manually via Acrobat 9.
          As I wanted to test the deletion of all pages (except the first) with PDFBox I inserted 3 pages into the test files manually as follows:
          The page was inserted via Acrobat 9 via menu
          Document\Insert Pages\From Clipboard (any content from clipboard)

          Show
          nielsen added a comment - Hello Adam, Here a hint which might help you: I suppose the problem in test.pdf was caused how page 2,3 and 4 were inserted. Originally the test file was an export from Word (via plugin) with many pages. Then all pages except the first page were deleted manually via Acrobat 9. As I wanted to test the deletion of all pages (except the first) with PDFBox I inserted 3 pages into the test files manually as follows: The page was inserted via Acrobat 9 via menu Document\Insert Pages\From Clipboard (any content from clipboard)
          Hide
          Martijn Brinkers added a comment -

          Yes the uncompressed version works fine i.e., 4 pages.

          Show
          Martijn Brinkers added a comment - Yes the uncompressed version works fine i.e., 4 pages.
          Hide
          nielsen added a comment -

          For test.unc.pdf the no. of pages is 4. No problem with this file.
          I use Win XP Prof., Service Pack 3, Java 1.6.0_22

          Show
          nielsen added a comment - For test.unc.pdf the no. of pages is 4. No problem with this file. I use Win XP Prof., Service Pack 3, Java 1.6.0_22
          Hide
          Adam Nichols added a comment -

          I'm not sure why I'm getting different results. I'm using Windows Vista (32-bit), Java 1.5, PDFBox trunk. I'd love to dig into it and see what the problem is, but that's pretty difficult without being able to reproduce any problems. Could someone who's having this issue with test.pdf try test.unc.pdf and see if the problem exists there as well? If so, that file will be way easier to debug.

          Show
          Adam Nichols added a comment - I'm not sure why I'm getting different results. I'm using Windows Vista (32-bit), Java 1.5, PDFBox trunk. I'd love to dig into it and see what the problem is, but that's pretty difficult without being able to reproduce any problems. Could someone who's having this issue with test.pdf try test.unc.pdf and see if the problem exists there as well? If so, that file will be way easier to debug.
          Hide
          nielsen added a comment -

          Hello Martijn,
          I wasn't aware that our mail exchange is tracked by the Apache issue tracker. That's why I attached your answer

          -------- Original-Nachricht --------


          GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit
          gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl

          Show
          nielsen added a comment - Hello Martijn, I wasn't aware that our mail exchange is tracked by the Apache issue tracker. That's why I attached your answer -------- Original-Nachricht -------- – GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl
          Hide
          nielsen added a comment -

          Hello Adam,

          Your right. The file has 4 pages (I forgot) but getNumberOfPages()
          still returns 1. The file was used as is (no uncompressing before processing).
          The user Martijn Brinkers obviously confirms
          this behaviour (see image attached).

          I used you test code slightly modified (removed assert and fail) to
          retest.
          The input file was attached again with prefix a.

          The code used:
          String inputpath = "C:\\Temp
          atest.pdf";
          PDDocument doc = null;
          try

          { doc = PDDocument.load(inputpath); int x = doc.getNumberOfPages(); x = -1; }

          catch (Exception e)

          { e.printStackTrace(); }

          finally {
          if(doc != null)
          try

          { doc.close(); }

          catch(Exception e) {}
          }

          Thank you.

          Regards
          Michael

          -------- Original-Nachricht --------


          GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit
          gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl

          Show
          nielsen added a comment - Hello Adam, Your right. The file has 4 pages (I forgot) but getNumberOfPages() still returns 1. The file was used as is (no uncompressing before processing). The user Martijn Brinkers obviously confirms this behaviour (see image attached). I used you test code slightly modified (removed assert and fail) to retest. The input file was attached again with prefix a. The code used: String inputpath = "C:\\Temp atest.pdf"; PDDocument doc = null; try { doc = PDDocument.load(inputpath); int x = doc.getNumberOfPages(); x = -1; } catch (Exception e) { e.printStackTrace(); } finally { if(doc != null) try { doc.close(); } catch(Exception e) {} } Thank you. Regards Michael -------- Original-Nachricht -------- – GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl
          Hide
          Martijn Brinkers added a comment -

          That's weird because for me it always returns 1 page. I have tried it with the 1.3.1 release and with trunk.

          Show
          Martijn Brinkers added a comment - That's weird because for me it always returns 1 page. I have tried it with the 1.3.1 release and with trunk.
          Hide
          Adam Nichols added a comment -

          I downloaded the PDF and found that it had 4 pages (not 2). I checked getNumberOfPages() and it returned 4, so I'm unable to reproduce the problem. Here's the exact code I'm using:

          public void testPdfBox911() {
          String inputpath = "C:\\Temp\\PDFBOX-911
          test.pdf";
          PDDocument doc = null;
          try

          { doc = PDDocument.load(inputpath); assertEquals(4, doc.getNumberOfPages()); }

          catch (Exception e)

          { e.printStackTrace(); fail("Threw exception!"); }

          finally {
          if(doc != null)
          try

          { doc.close(); }

          catch(Exception e) {}
          }
          }

          For some insight on how PDF determines the number of pages, here's how I looked into the issue. I opened the pdf in a text editor and found that it was compressed, so I uncompressed it using a program called PDF toolkit (i.e. pdftk test.pdf output test.unc.pdf uncompress) and looked at the uncompressed version. I found the root was object 1 0, and /Pages was 4 0 which had the four pages. So everything seems to be okay here as far as I can tell.

          Show
          Adam Nichols added a comment - I downloaded the PDF and found that it had 4 pages (not 2). I checked getNumberOfPages() and it returned 4, so I'm unable to reproduce the problem. Here's the exact code I'm using: public void testPdfBox911() { String inputpath = "C:\\Temp\\ PDFBOX-911 test.pdf"; PDDocument doc = null; try { doc = PDDocument.load(inputpath); assertEquals(4, doc.getNumberOfPages()); } catch (Exception e) { e.printStackTrace(); fail("Threw exception!"); } finally { if(doc != null) try { doc.close(); } catch(Exception e) {} } } For some insight on how PDF determines the number of pages, here's how I looked into the issue. I opened the pdf in a text editor and found that it was compressed, so I uncompressed it using a program called PDF toolkit (i.e. pdftk test.pdf output test.unc.pdf uncompress) and looked at the uncompressed version. I found the root was object 1 0, and /Pages was 4 0 which had the four pages. So everything seems to be okay here as far as I can tell.
          Hide
          nielsen added a comment -

          Hello Adam,

          Please find the file attached. I didn't find a was to attach it
          to the issue.

          The second page was inserted via Acrobat 9 via menu
          Document\Insert Pages\From Clipboard (any content from clipboard)

          I wonder what the criteria for PDFBox is,
          to count resp. identify PDF pages.

          Regards
          Michael

          -------- Original-Nachricht --------


          GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit
          gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl

          Show
          nielsen added a comment - Hello Adam, Please find the file attached. I didn't find a was to attach it to the issue. The second page was inserted via Acrobat 9 via menu Document\Insert Pages\From Clipboard (any content from clipboard) I wonder what the criteria for PDFBox is, to count resp. identify PDF pages. Regards Michael -------- Original-Nachricht -------- – GMX DSL Doppel-Flat ab 19,99 €/mtl.! Jetzt auch mit gratis Notebook-Flat! http://portal.gmx.net/de/go/dsl
          Hide
          Adam Nichols added a comment -

          Please go ahead and attach the file. We'll need to take a look at that to see what's going on here.

          Show
          Adam Nichols added a comment - Please go ahead and attach the file. We'll need to take a look at that to see what's going on here.

            People

            • Assignee:
              Andreas Lehmkühler
              Reporter:
              nielsen
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development