[PDFBOX-1000] Conforming parser - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.7.0
Component/s: Parsing
Labels:
None

Description

A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1]. Once this is read, it will read in the xref table so it can locate other objects and revisions. This also allows skipping objects which have been rendered obsolete (per the xref table)[2]. It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested. This is all laid out in the official PDF specification, ISO 32000-1:2008.

Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser. Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes. Changes to existing classes will be kept to a minimum in order to prevent regression bugs.

[1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
[2] Section 7.5.4 "the entire file need not be read to locate any particular object"

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

conforming-parser.patch
28/Apr/11 04:29
6 kB
Adam Nichols
ConformingPDDocument.java
25/Apr/11 06:29
3 kB
Adam Nichols
ConformingPDFParser.java
28/Apr/11 04:35
27 kB
Adam Nichols
ConformingPDFParserTest.java
25/Apr/11 06:29
2 kB
Adam Nichols
COSUnread.java
28/Apr/11 04:35
2 kB
Adam Nichols
gdb-refcard.pdf
25/Apr/11 06:29
75 kB
Adam Nichols
PDFLexer.java
19/Jul/12 07:22
46 kB
Maruan Sahyoun
PDFLexer.java
08/Apr/12 17:14
45 kB
Maruan Sahyoun
PDFStreamConstants.java
19/Jul/12 07:22
4 kB
Maruan Sahyoun
PDFStreamConstants.java
08/Apr/12 17:14
5 kB
Maruan Sahyoun
XrefEntry.java
26/Apr/11 02:45
1 kB
Adam Nichols

Issue Links

is depended upon by

PDFBOX-911 Method PDDocument.getNumberOfPages() returns wrong number of pages

Closed

Activity

People

Assignee:: Adam Nichols

Reporter:: Adam Nichols

Votes:: 4 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 21/Apr/11 06:31

Updated:: 11/Oct/14 04:08

Resolved:: 11/Oct/14 04:08