[PDFBOX-796] Objects from streams overwrite objects already read with the same ID/Generation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.3.1
Component/s: Parsing
Labels:
None
Environment:
32-bit Windows Vista, Java 1.5, PDFBox head tag

Description

When trying to merge some documents (using the PDFMergerUtility class) I got a NullPointerException and the merge failed. I traced through to eventually discover that some objects were being overwritten when the PDFParser called document.dereferenceObjectStreams(); (line 207 of PDFParser.java).

Having multiple objects with the same object ID is a violation of the PDF specification, so how this should be dealt with is undefined. The "use the first object" mentality enabled my file to be processed and it is consistent with the other code in PDFBox. For another example of where PDFBox deals with reading in an object which already exists, you can see PDFParser (on line 541) checks to see if the object has already been read and put in the pool. If not, it adds it to the list of conflicts. Later, when resolveConflicts() is called, it overwrites the object only if it's specifically referenced in the xref table. This is a reasonable way to resolve conflicts because if the object isn't in the xref table, it is likely the wrong one.

Since we're reading from a stream of compressed data, we can not give a particular byte offset. This means we can't add these conflicts to the conflict list and try to determine if this object is legitimate or not. It's best to use the data we've already read, as using the one from the stream has been confirmed to cause problems. I've done regression testing with other files which have this problem, including the file from ~~PDFBOX-720~~ and have not seen any issues.

Unfortunately I can not provide the PDF which demonstrates this problem and solution as it contains information I'm not authorized to release.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-796.patch
20/Aug/10 23:59
0.8 kB
Adam Nichols

Issue Links

relates to

PDFBOX-911 Method PDDocument.getNumberOfPages() returns wrong number of pages

Closed

Activity

People

Assignee:: Adam Nichols

Reporter:: Adam Nichols

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Due:: 27/Aug/10

Created:: 20/Aug/10 23:52

Updated:: 03/Dec/10 04:46

Resolved:: 26/Aug/10 17:45