[PDFBOX-2402] NonSequentialPDFParser cannot recover from spurious closing brackets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.8.8, 2.0.0
Fix Version/s: 1.8.8, 2.0.0
Component/s: Parsing
Labels:
None

Description

The NonSequentialPDFParser fails if an object has a spurious closing tag (for example, a PDFArray with two closing brackets). In lenient mode, it would be good to at least attempt recovering from that. The attached patch, instead of throwing an exception in case the endObject string is not "endobj" or " obj", skips a character (the spurious character) and tries reading a string. It continues until either the file ends or an "endobj" is found.

I have a document where this worked but I am not allowed to upload it, unfortunately. In any case the patch cannot make things worse, since it replaces throwing an exception with at least attempting to recover from it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

file screenshot.png
09/Oct/14 17:46
84 kB
Michele Balistreri
NonSequentialPDFParser.patch
04/Oct/14 17:39
2 kB
Michele Balistreri

Issue Links

relates to

PDFBOX-1811 java.io.IOException: Object at offset does not end with 'endobj'

Closed

Activity

People

Assignee:: Tilman Hausherr

Reporter:: Michele Balistreri

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Oct/14 17:37

Updated:: 13/Dec/14 14:15

Resolved:: 11/Oct/14 13:25