[PDFBOX-1299] BaseParser.readUntilEndOfStream can stop too early, causing IOException on valid PDFs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: None
Labels:
None

Description

The purpose of BaseParser.readUntilEndOfStream is to scan ahead,
copying bytes to the output, stopping once it sees "endstream".

The problem with this approach is sometimes the stream data itself
contains endstream causing readUntilEndOfStream to stop too early.

This can legitimately happen when the stream is an embedded PDF; I'll
attach a test PDF showing this.

However, the stream dict declares the stream length (in bytes)... so
it seems like we should be respecting that length (if present) and
simply copy over that many bytes, instead of scanning the stream bytes
for endstream? This should be a lot faster too...

I imagine we always scan so that we are more robust if the length is
missing/invalid? Is that why this method was used? (I don't know the
history here...). If so, maybe we can have an option to use
the declared stream length if present.

I have a patch to use the declared stream length (if present), and it enables
at least this test PDF to correctly parse.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-1299.patch
29/Apr/12 13:46
2 kB
Michael McCandless
Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf
29/Apr/12 14:38
6.89 MB
Michael McCandless

Activity

People

Assignee:: Timo Boehme

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Apr/12 13:44

Updated:: 21/May/12 15:51

Resolved:: 21/May/12 14:07