Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4952

PDF compression - object stream creation

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.21
    • 3.0.0 PDFBox
    • PDModel
    • None

    Description

      I implemented a basic starting point to realize a PDF compression based on PDFBox 2.0.22-SNAPSHOT

      I want to use this ticket, to ask if you would be interested in such a feature and whether you would be interested to merge it into PDFBox.

      This is sort of a POC, only implementing some very basic functionality, that surely must and could be extended further and it does only implement some very basic and simplistic Unit Tests.
      However it is able to reduce the size of resulting documents, and creates objectstreams as defined in the PDF reference manual.

      What it currently does:
      It provides the bundling and compression of objects to objectstreams and further applies simple content compression to a small selection of contents.

      To realize content compression, it provides a simple interface and abstract class for "ContentCompressor"s which search a document for specific content, that could be compressed and do compress that contents.

      Currently two content compressors exist:
      ImageCompressor
      Searches for simple images, that could be compressed using DCT.

      UnencodedStreamCompressor
      Searches the document for yet unencoded streams and applies a Flate compression where necessary.

      Both compressors can be parameterized using a centralized "CompressParameters" instance which is passed to a new "saveCompressed" method of PDDocument.

      The compression is based on, modifies and is realized by a set of extensions for the "COSWriter" class. Basically it organizes objects, that are passed to the COSWriter in objectStreams and applies content optimization where necessary and possible.

      Currently this does support encryption, but does not support linearization of the compressed documents.

      Caveat:
      If this feature is interesting to you, then I would not expect you to simply merge this fork into 2.0.22. I am expecting that you would like to have some details and concepts changed and am ready to implement changes that would be required for this to work to your liking.

      POC:
      4 resulting documents can be found in "target/test-output/compression" when "COSDocumentCompressionTest" is run.

      The Pull request can be found on Github at:
      https://github.com/apache/pdfbox/pull/86

      Attachments

        1. problematic.pdf
          306 kB
          Christian Appl
        2. image-2021-08-17-10-56-48-431.png
          88 kB
          Christian Appl
        3. image-2021-08-17-10-24-44-999.png
          73 kB
          Christian Appl
        4. image-2021-08-17-10-21-00-352.png
          83 kB
          Christian Appl
        5. image-2021-08-17-10-10-21-418.png
          16 kB
          Christian Appl
        6. image-2021-08-17-10-07-33-682.png
          69 kB
          Christian Appl
        7. image-2020-09-07-10-05-15-631.png
          4 kB
          Christian Appl
        8. image-2020-09-07-09-47-30-172.png
          6 kB
          Christian Appl
        9. 102_Spot_to_CMYK_X1a.pdf
          1.30 MB
          Tilman Hausherr
        10. 102_Spot_to_CMYK_X1a_unc_GOOD-2.0.22.pdf
          1.52 MB
          Tilman Hausherr
        11. 102_Spot_to_CMYK_X1a_unc_BAD-3.0.0.pdf
          1.52 MB
          Tilman Hausherr

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              capSVD Christian Appl
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: