Just some background on the problem:
It was found that enabling accessibility (tagged PDF) decreases PDF production performance considerably.
I've profiled FOP with an FO file (about 10 pages). I ran both FO->PDF and FO->IF->PDF scenarios to isolate the bulk of the "lost" time. It turns out that the FO-IF stage doesn't lose a lot of performance due to the additional work. So I concentrated on IF->PDF.
The VisualVM profiler highlighted PDFDocument.getWriterFor() and BufferedOutputStream.flush() as hot spots in the accessibility case. Most of that is caused by PDFDictionary, PDFArray and PDFName. And the strong weight on these two is actually expected since Tagged PDF structures are all dictionaries and arrays. Lots of them.
Look at the PDF sizes:
- Normal PDF: 105 KB (65 PDF Objects)
- Tagged PDF: 868 KB (6462 PDF Objects)
That's A LOT of additional content. All dictionaries and arrays that cannot be compressed (in PDF 1.4). That also means a big increase in I/O output. So it's in nature of tagged PDF that it must be considerably slower.
What I've tried now is to address the hot spot I found above. I got rid of the Writers for encoding text output. Instead I switched to a StringBuilder that is flushed to the OutputStream when necessary. That decreases the average processing time after warm-up (IF->PDF case) from 775ms to 460ms (normal PDF from 355ms to 325ms). That is a speed-up of:
(460 - 325) / (775 - 355) = 135 / 420 = 0.32 = -68%
So it cuts the tagged PDF penalty to a third.
That was the IF->PDF case. Here are the measurements for the FO->PDF case (the same test document:
normal PDF: 772ms --> 712ms
tagged PDF: 1472ms --> 1042ms
normal PDF: 712 / 772 = 0.92 (-8%)
tagged PDF: 1042 / 1472 = 0.71 (-29%)
tagged PDF penalty: (1042 - 712) / (1472 - 772) = 330 / 700 = 0.47 (-53%)
There's a catch: This optimization requires a backwards-incompatible change in the PDF library. The PDFWritable interface changes from
void outputInline(OutputStream out, Writer writer) throws IOException;
void outputInline(OutputStream out, StringBuilder textBuffer) throws IOException;
The same applies to PDFObject.formatObject(). Both are very central parts of the PDF library. It could invalidate pending patches or private additions from third-parties. But it doesn't seem to be easy enough to write adapter code to work around this.