Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4341

[Patch] PNGConverter: PNG bytes to PDImageXObject converter



    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 2.0.12
    • 2.0.18, 3.0.0 PDFBox
    • Writing
    • None


      The attached patch implements a PNG bytes to PDImageXObject converter. It tries to create a PDImageXObject from the chunks of a PNG image, without recompressing it. This allows to use programs like pngcrush and friends to embedded optimal compressed images. It’s also way faster than recompressing the image.

      The class PNGConverter does this in three steps:

      • Parsing the PNG chunk structure from the byte array
      • Validating all relevant data chunks (i.e. checking the CRC). Chunks which are not needed (e.g. text chunks) are not validated.
      • Constructing a PDImageXObject from the chunks

      When at any of this steps an error occurs or the converter detects that it is not possible to map the image, it will bail out and return null. In this case the image has to be embedded the „normal“ way by reading it using ImageIO and compressing it again.

      Only this PNG image types can be converted (at least theoretically) without recompressing the image data:

      • Grayscale
      • Truecolor (i.e. RGB 8-Bit/16-Bit)
      • Indexed

      As soon as transparency is used it gets difficult:

      • Grayscale with alpha / truecolor with alpha: The alpha channel is saved in the image data stream, as they are stored as (Gray,Alpha) or (Red,Green,Blue,Alpha) tuples. You have to separate the alpha information for the SMASK-Image. At this moment you can just read and recompress it using the LosslessFactory.
      • Indexed with alpha. Alpha and color tables are separate in the PNG, so this should be possible to build a grayscale SMASK from the image data (which are just the table indices) and the alpha table. Tried that, but Acrobat Reader does not like indexed SMASKs… One could just build a grayscale SMASK using the alpha table and the decompressed image index data. This would at least save some space, as the optimized indexed image data is still used.

      With the current patch only truecolor without alpha images work correctly. The other tests for grayscale and indexed fail. (You must place the zipped images in the resources folder were png.png resides to run the testdrivers; This images are „original“ work done by me using Gimp, Krita and ImageOptim (on macOS) to build the different png image types.)

      Notes for the current patch:

      • The grayscale images have the wrong gamma curve. I tried using the ColorSpace.CS_GRAY ICC profile and the image seems now only „slightly“ off (i.e. pixel value FFD6D6D6 vs FFD7D7D7). As soon as a gAMA chunk is given the image is tagged with a CalGray profile, but the colors are way more off then.
      • The cHRM (chroma) chunk is read and should work, as I used the formula’s from the PDF spec to convert the cRHM values to the CalRGB whitepoint and matrix. I have not yet tested this, as I have no test image with cHRM at the moment. Note: Matrix(COSArray) and Matrix.toCOSArray() are fine for geometric matrices. But this methods are wrong for any other kind of matrix (i.e. color transform matrices), as they only store/restore 6 values of the 3x3 matrix. I deprecated PDCalRGB.setMatrix(Matrix) because of this, as this was never working and can not work as long as the Matrix class is for geometric use cases only. This should also be documented on the Matrix class, that it is not general purpose. I added a PDCalRGB.setMatrix(COSArray) method to allow to set the matrix.
      • The indexed image displays fine in Acrobat Reader, but the test driver fails as PDImageXObject.getImage() returns a complete black (everything 0) image. Strange, I suspect some error in the PDFBox image decoding.
      • If an image is tagged with sRGB, the builtin Java sRGB ICC profile is attached. Theoretically you can use a CalRGB colorspace, but using a ICC color profile is likely faster (at least in PDFBox) and more „standard“.

      You can also look at this patch on GitHub https://github.com/apache/pdfbox/compare/2.0...rototor:2.0-png-from-bytes-encoder?expand=1 if you like.

      It would be nice if someone could give me some hints with the colorspace problems. I will try to reread the specs again, maybe I have missed something. But it would be great if someone else who has an idea about colorspaces could also take a look into this.

      As I have no idea how long it takes to understand why the colors are off for grayscale and wrong for indexed, I could prepare a stripped down version of this patch, which only contains the working stuff (i.e. truecolor), and would just do nothing on the not working cases. What do you think?


        1. pngconvert_v3.patch
          224 kB
          Emmeran Seehuber
        2. optimized.zip
          313 kB
          Emmeran Seehuber
        3. image-2018-10-25-09-29-47-251.png
          132 kB
          Emmeran Seehuber
        4. 017063.png
          13 kB
          Tilman Hausherr
        5. 017012.png
          69 kB
          Tilman Hausherr
        6. 017030.png
          6 kB
          Tilman Hausherr
        7. 008528.png
          231 kB
          Tilman Hausherr
        8. 014431.png
          13 kB
          Tilman Hausherr
        9. 017084.png
          65 kB
          Tilman Hausherr
        10. 016289.png
          6 kB
          Tilman Hausherr
        11. 001230.png
          9 kB
          Tilman Hausherr
        12. 001229.png
          10 kB
          Tilman Hausherr
        13. pngconvert_v2.patch
          52 kB
          Emmeran Seehuber
        14. pngconvert_testimg.zip
          136 kB
          Emmeran Seehuber
        15. pngconvert_v1.patch
          34 kB
          Emmeran Seehuber



            tilman Tilman Hausherr
            rototor Emmeran Seehuber
            0 Vote for this issue
            4 Start watching this issue