Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-5279

PDF compression - Content compression



    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0 PDFBox
    • None
    • Rendering, Writing
    • None


      This is sort of a follow up ticket for PDFBOX-4952 and discusses, if this would be a good idea and which contents could be compressed to further reduce the file size of resulting PDF documents.

      I would like to provide a feature like that and possibly I will get the assignment to do something like it - but I will not repeat my error to first develop a solution and only then to contact you. Previous feedback resulted in much easier and better solutions, hence I created this ticket.

      Basic question:
      Should PDFBox be able to do something like that?
      All further thoughts are somewhat pointless, if it shall not.

      Limitations of object streaming and PDFBOX-4952:
      Object streaming has a chance to reduce the file size of documents containing many referencable top level objects, by bundling and compressing them in a common stream.

      It also has a chance to increase the filesize for documents containing only few of such objects (due to the overhead of object stream creation).

      Content Compressors:
      To enhance the compression, originally that ticket contained a suggestion for "ContentCompressor"s which should have had the task to recognize compressible structures and that should further reduce their size by applying a set of rules defined by the "CompressParameters".
      Those compressors should have been integrated in the flow of object stream creation/compression.

      As a starting point the following "ContentCompressors" were originally suggested:

      • "UnencodedStreamCompressor" - apply FLATE to previously uncompressed streams.
      • "ImageCompressor" - apply DCT compression to images, with a configurable quality and resolution.

      In the POC both had sort of an effect and reduced the file size drastically for some PDFs.
      But when I picked up the shelved code again I was able to create a test case for e.g. image compression, where the file size exploded, when trying to translate a bmp ImageXObject to a DCT stream.
      Therefore I have to assume - those compressors were fit to be a POC, but do not work reliably and as expected.

      The first usecase, that directly comes to mind, are Scanners. Some Scanners create huge high res images for scanned pages and bundle them to PDFs, which results in large PDFs, that waste a lot of diskspace for questionable benefits, also causing issues when trying to further process them (e.g. sending them via mail, with a limited attachment size).

      Such scans could benefit from a compression like this.

      Step back:
      If you are still with me, let me ask some questions first:

      • If something like that shall be implemented - do these compressors make any sense?
      • Is it at all possible for the COSWriter to encounter uncompressed streams, or is this compressor entirely superfluous?
      • Is it a good idea to compress images and do you have an idea how to solve this?
      • What else could be compressed?
      • Does this already exist for PDFBox - even if it was only a partial solution?




            Unassigned Unassigned
            capSVD Christian Appl
            0 Vote for this issue
            2 Start watching this issue