I'm currently working on this, so I wanted to open an issue to let everyone know.
Color spaces need to be refactored in 2.0.0. Tilman noticed slowness in
PDFBOX-1851 due to using ICC profiles and calling ColorSpace#toRGB for every pixel. For example, the file from PDFBOX-1851 went from rendering in 4 seconds to taking over 60 seconds.
The solution is to use ColorConvertOp to convert an entire BufferedImage in one go, taking advantage of AWT's native color management module. Color conversions done this way are almost instantaneous, even for large images.
The current design of color spaces within PDFBox depends upon conversions being done on a per-pixel basis, so a significant refactoring is needed in order to convert images using ColorConvertOp without having to resort to per-pixel calls in cases such as a Separation color space which uses a CMYK alternate color space via a tint-transform.
The color space handling code is also tightly coupled to image handling. The various classes which read images each have their own color handling code which rely on per-pixel conversions. For this reason any color space refactoring must also included a significant refactoring of image handling code. This is an opportunity to refactor all color handling so that it is encapsulated within the color space classes, allowing downstream users to call toRGB(float) or toRGB(BufferedImage) and not need to worry about tint transforms and the like.
Here's a summary of the changes:
- PDCcitt has been removed, its reading capability has moved to CCITTFaxFilter and writing capability has moved to CCITTFactory.
- PDJpeg has been removed. JPEG reading is now done by new code in DCTFilter which correctly handles CMYK/YCCK color. This fixes various files where images appeared like negatives. JPEG writing is done by new code in JPEGFactory.
- cleaned up JBIG2Filter
- cleaned up JPXFilter, in particular calling decode() caused the stream dictionary to be updated, which was unsafe. I've also added a special JPXColorSpace which wraps the embedded AWT color space of a JPX BufferedImage, this replaces the need for the awkward mapping of ColorSpace to PDColorSpace.
- Added better error messages for missing JAI plugins (JPX, JBIG2). A special exception, MissingImageReaderException is now thrown.
- PDXObjectForm has been renamed to PDFormXObject to match the PDF spec.
- PDXObjectImage has been renamed in the same manner.
- PDInlinedImage has been renamed to PDInlineImage for the same reason.
- CCITTFaxDecodeFilter has been renamed to CCITTFaxFilter for consistency with the other filters.
- ImageParameters has been removed, it was used to represent inline image parameters which are now simply members of PDInlineImage.
- added PDColor which represents a color value, including patterns, it is immutable for ease of use.
- removed PDColorState which was a container for both a color and a color space, in almost every case it was used to represent a color and so has been replaced by PDColor and occasionally PDColorSpace.
- moved most of the functionality of PDXObject into its subclasses
- rewrote almost all color handling code in all PDColorSpace subclasses, including fixing the calculations for l*a*b, DeviceN, and indexed color spaces.
- all color spaces now implement a toRGB(float) function for color conversion, so external consumers of color spaces no longer have to know about internals such as tint transforms.
- image color conversion is now performed in one operation, using ColorConvertOp, rather than pixel-by-pixel, this speeds up ICC transforms by many orders of magnitude. Color spaces now expose a special method toImageRGB(Raster) for this purpose. This fixes some known performance issues with certain files.
- updated Type1, Axial, Radial, and Gouraud shading contexts to call the new toRGB functions. This is an interim measure, for better performance the color conversion should instead be done using toImageRGB after the entire gradient is drawn to the raster.
- creation of AWT Paint has been moved inside color spaces, hiding the details from the caller. It is no longer possible to get an AWT Color from a color space, only a Paint may be obtained.
- removed PDColorSpaceFactory and moved its functionality into PDColorSpace.
- moved some of the new shading and tiling pattern code to PDPattern so that toPaint() is encapsulated in the color space.
- new PDImage interface which is implemented by both PDInlineImage and PDImageXObject
- Image XObject image reading, masking and stencilling code has been rewritten, resulting in the removal of CompositeImage.
- new SampledImageReader performs image reading for all formats, including JPEG and CCITT. The format itself is simply a filter, as is the case in the PDF spec. New image reading handles decode arrays, interpolation, and conversion of all image types to efficient 8bpp rasters. This replaces PDPixelMap as well as reading code from PDJpeg and PDCcitt. Handling of decod arrays fixes various issues where images were inverted, especially inline images in Type 3 fonts.
- removed SetNonStrokingICCBasedColor, SetNonStrokingIndexed, SetNonStrokingPattern, SetNonStrokingSeparation, SetStrokingICCBasedColor, SetStrokingIndexed, SetStrokingPattern, SetStrokingSeparation, and replaced them with SetColor.