[PDFBOX-4512] Otimizes pdf parser for signature operations - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.14
Fix Version/s: None
Component/s: Signing
Labels:
None

Description

This suggestion was commented in ~~PDFBOX-4511~~ and i am describing it with more details. It comes to be necessary for us because in some situations one user can sign/verify a lot of pdfs and each pdfs is parsed in memory (which consumes a lot of memory).
I already use this for about two years and until today only one pdf has parsed weird.

This is the list of class modified:

ParserType: A new enum that indicates the type of parser to be made in the document, it has the following types:

Complete: All PDF objects are read and parsed, it is the current behavior and is the default if no other value is provided.
Signing: Used for digital signature, the following entries are read:
- Root is read to have access to the Pages and Acroform entries;
- Acroform is read for access to the Fields entry (signatures).
- Pages is read to have access to the number of pages and references of page objects.
- Each Page entry is read for access to the page rotation and dimensions, as well as the set of annotations (signatures) to be able to add the new signature.
- FT is read to have access to input V (signature itself).
- Sig entry is read to have access to the Reference entry (constraint).

Extended: Used for the inclusion of DSS and VRI.
- Root is read to access the Acroform and DSS entries;
- Acroform is read for access to the Fields entry (signatures).
- DSS is read entirely to know which objects have already been included in the signature.
- FT is read to have access to input V (signature itself).
- Sig entry is read to access Contents (required to calculate the SHA1 hash which is the signature identifier in the DSS).

Verifying: Used for digital signature verification.
- Root is read to access the Pages, Acroform and DSS entries;
- Acroform is read for access to the Fields entry (signatures).
- DSS is read entirely.
- Pages is read to have access to the number of pages.
- FT is read to have access to input V (signature itself).
- Sig entry is read to have access to the Reference (constraint), Contents (required to calculate the SHA1 hash that is the signature identifier in the DSS), field name, filter, subfilter, ByteRange (to calculate signed content)

PDDocument: New load method that accepts the ParserType enum. Old methods do not change the behavior. The ParserType value is passed to the new parse(ParserType) method of the PDFParser class.

PDFParser: Accept the passed value of ParserType in the parse method that will be used in the parseDictObjects(COSDictionary dict, ParserType parserType, COSName... excludeObjects) method.

COSParser: New method parseDictObjects(COSDictionary dict, ParserType parserType, COSName... excludeObjects) to handle entries that is not necessary, or only need to be partially read, depending on the value of ParserType.

COSName: Add of two new names, "DSS" and "Reference"

This is just an alternative way for parser pdf to realize signature operations.

Thanks

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cosname.patch
11/Apr/19 20:46
1 kB
Luiz
cosparser.patch
11/Apr/19 20:46
14 kB
Luiz
pddocument.patch
11/Apr/19 20:45
15 kB
Luiz
pdfparser.patch
11/Apr/19 20:45
4 kB
Luiz

Issue Links

links to

Stack overflow - Faster PDF page dimensions using PDFBox?

Stack overflow - How to fetch MediaBox of PDF pages without parsing whole file?

Activity

People

Assignee:: Unassigned

Reporter:: Luiz

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Apr/19 20:50

Updated:: 02/May/19 09:51