Bug 51664 - [PATCH] Tagged PDF performance improvement + tests
Summary: [PATCH] Tagged PDF performance improvement + tests
Status: CLOSED FIXED
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: pdf (show other bugs)
Version: 1.0
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-08-16 13:45 UTC by Mehdi Houshmand
Modified: 2012-03-28 06:36 UTC (History)
0 users



Attachments
patch (69.32 KB, patch)
2011-08-16 13:45 UTC, Mehdi Houshmand
Details | Diff
More tests (86.38 KB, patch)
2011-08-16 14:11 UTC, Mehdi Houshmand
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mehdi Houshmand 2011-08-16 13:45:31 UTC
This patch serves to address the slow performance of accessibility features in PDF creation. This is a collaborative effort between myself and Jeremias.
Comment 1 Mehdi Houshmand 2011-08-16 13:45:56 UTC
Created attachment 27391 [details]
patch
Comment 2 Mehdi Houshmand 2011-08-16 14:11:13 UTC
Created attachment 27392 [details]
More tests

This patch has been separated since it is purely unit tests.
Comment 3 Jeremias Maerki 2011-08-16 14:39:23 UTC
Just some background on the problem:

It was found that enabling accessibility (tagged PDF) decreases PDF production performance considerably.

I've profiled FOP with an FO file (about 10 pages). I ran both FO->PDF and FO->IF->PDF scenarios to isolate the bulk of the "lost" time. It turns out that the FO-IF stage doesn't lose a lot of performance due to the additional work. So I concentrated on IF->PDF.

The VisualVM profiler highlighted PDFDocument.getWriterFor() and BufferedOutputStream.flush() as hot spots in the accessibility case. Most of that is caused by PDFDictionary, PDFArray and PDFName. And the strong weight on these two is actually expected since Tagged PDF structures are all dictionaries and arrays. Lots of them.

Look at the PDF sizes:
- Normal PDF: 105 KB (65 PDF Objects)
- Tagged PDF: 868 KB (6462 PDF Objects)

That's A LOT of additional content. All dictionaries and arrays that cannot be compressed (in PDF 1.4). That also means a big increase in I/O output. So it's in nature of tagged PDF that it must be considerably slower.

What I've tried now is to address the hot spot I found above. I got rid of the Writers for encoding text output. Instead I switched to a StringBuilder that is flushed to the OutputStream when necessary. That decreases the average processing time after warm-up (IF->PDF case) from 775ms to 460ms (normal PDF from 355ms to 325ms). That is a speed-up of:

(460 - 325) / (775 - 355) = 135 / 420 = 0.32 = -68%
So it cuts the tagged PDF penalty to a third.

That was the IF->PDF case. Here are the measurements for the FO->PDF case (the same test document:

normal PDF: 772ms --> 712ms
tagged PDF: 1472ms --> 1042ms

normal PDF: 712 / 772 = 0.92 (-8%)
tagged PDF: 1042 / 1472 = 0.71 (-29%)
tagged PDF penalty: (1042 - 712) / (1472 - 772) = 330 / 700 = 0.47 (-53%)

There's a catch: This optimization requires a backwards-incompatible change in the PDF library. The PDFWritable interface changes from
void outputInline(OutputStream out, Writer writer) throws IOException;
to
void outputInline(OutputStream out, StringBuilder textBuffer) throws IOException;

The same applies to PDFObject.formatObject(). Both are very central parts of the PDF library. It could invalidate pending patches or private additions from third-parties. But it doesn't seem to be easy enough to write adapter code to work around this.
Comment 4 Chris Bowditch 2011-10-05 09:55:37 UTC
Thanks for the patch Mehdi. Looking at [1] I realised that you haven't yet filed an ICLA. The ASF encourages all contributors to do so and it is a pre-requisite to becoming a committer. Since you have filed  a number of patches its about time you filed in the ICLA as described in [2]. Thanks

[1] http://people.apache.org/committer-index.html
[2] http://www.apache.org/licenses/
Comment 5 Mehdi Houshmand 2011-10-05 10:38:18 UTC
(In reply to comment #4)
> Thanks for the patch Mehdi. Looking at [1] I realised that you haven't yet
> filed an ICLA. The ASF encourages all contributors to do so and it is a
> pre-requisite to becoming a committer. Since you have filed  a number of
> patches its about time you filed in the ICLA as described in [2]. Thanks
> 
> [1] http://people.apache.org/committer-index.html
> [2] http://www.apache.org/licenses/

I have faxed the appropriate documents.
Comment 6 Chris Bowditch 2012-01-06 16:05:15 UTC
Hi Mehdi, Jeremias

Thanks for the patch. This has been committed in revision: 1228243. I had to adjust the unit tests to Junit 4, but otherwise the changes are fine.

Thanks,

Chris