Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4512

Otimizes pdf parser for signature operations

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.14
    • None
    • Signing
    • None

    Description

      This suggestion was commented in PDFBOX-4511 and i am describing it with more details. It comes to be necessary for us because in some situations one user can sign/verify a lot of pdfs and each pdfs is parsed in memory (which consumes a lot of memory).
      I already use this for about two years and until today only one pdf has parsed weird.

      This is the list of class modified:

      ParserType: A new enum that indicates the type of parser to be made in the document, it has the following types:

      • Complete: All PDF objects are read and parsed, it is the current behavior and is the default if no other value is provided.
      • Signing: Used for digital signature, the following entries are read:
        • Root is read to have access to the Pages and Acroform entries;
        • Acroform is read for access to the Fields entry (signatures).
        • Pages is read to have access to the number of pages and references of page objects.
        • Each Page entry is read for access to the page rotation and dimensions, as well as the set of annotations (signatures) to be able to add the new signature.
        • FT is read to have access to input V (signature itself).
        • Sig entry is read to have access to the Reference entry (constraint).
      • Extended: Used for the inclusion of DSS and VRI.
        • Root is read to access the Acroform and DSS entries;
        • Acroform is read for access to the Fields entry (signatures).
        • DSS is read entirely to know which objects have already been included in the signature.
        • FT is read to have access to input V (signature itself).
        • Sig entry is read to access Contents (required to calculate the SHA1 hash which is the signature identifier in the DSS).
      • Verifying: Used for digital signature verification.
        • Root is read to access the Pages, Acroform and DSS entries;
        • Acroform is read for access to the Fields entry (signatures).
        • DSS is read entirely.
        • Pages is read to have access to the number of pages.
        • FT is read to have access to input V (signature itself).
        • Sig entry is read to have access to the Reference (constraint), Contents (required to calculate the SHA1 hash that is the signature identifier in the DSS), field name, filter, subfilter, ByteRange (to calculate signed content)

      PDDocument: New load method that accepts the ParserType enum. Old methods do not change the behavior. The ParserType value is passed to the new parse(ParserType) method of the PDFParser class.

      PDFParser: Accept the passed value of ParserType in the parse method that will be used in the parseDictObjects(COSDictionary dict, ParserType parserType, COSName... excludeObjects) method.

      COSParser: New method parseDictObjects(COSDictionary dict, ParserType parserType, COSName... excludeObjects) to handle entries that is not necessary, or only need to be partially read, depending on the value of ParserType.

      COSName: Add of two new names, "DSS" and "Reference"

       

      This is just an alternative way for parser pdf to realize signature operations.

      Thanks

      Attachments

        1. pddocument.patch
          15 kB
          Luiz
        2. pdfparser.patch
          4 kB
          Luiz
        3. cosparser.patch
          14 kB
          Luiz
        4. cosname.patch
          1 kB
          Luiz

        Activity

          People

            Unassigned Unassigned
            Urias Luiz
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: