FOP
  1. FOP
  2. FOP-1958

[PATCH] Tagged PDF performance improvement + tests

    Details

    • Type: Bug Bug
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: None
    • Component/s: renderer/pdf
    • Labels:
      None
    • Environment:
      Operating System: Linux
      Platform: PC
    • External issue ID:
      51664

      Description

      This patch serves to address the slow performance of accessibility features in PDF creation. This is a collaborative effort between myself and Jeremias.

      1. moretests.patch
        86 kB
        Mehdi Houshmand
      2. performanceandtests.patch
        69 kB
        Mehdi Houshmand

        Activity

        Hide
        Chris Bowditch added a comment -

        Hi Mehdi, Jeremias

        Thanks for the patch. This has been committed in revision: 1228243. I had to adjust the unit tests to Junit 4, but otherwise the changes are fine.

        Thanks,

        Chris

        Show
        Chris Bowditch added a comment - Hi Mehdi, Jeremias Thanks for the patch. This has been committed in revision: 1228243. I had to adjust the unit tests to Junit 4, but otherwise the changes are fine. Thanks, Chris
        Hide
        Mehdi Houshmand added a comment -

        (In reply to comment #4)
        > Thanks for the patch Mehdi. Looking at [1] I realised that you haven't yet
        > filed an ICLA. The ASF encourages all contributors to do so and it is a
        > pre-requisite to becoming a committer. Since you have filed a number of
        > patches its about time you filed in the ICLA as described in [2]. Thanks
        >
        > [1] http://people.apache.org/committer-index.html
        > [2] http://www.apache.org/licenses/

        I have faxed the appropriate documents.

        Show
        Mehdi Houshmand added a comment - (In reply to comment #4) > Thanks for the patch Mehdi. Looking at [1] I realised that you haven't yet > filed an ICLA. The ASF encourages all contributors to do so and it is a > pre-requisite to becoming a committer. Since you have filed a number of > patches its about time you filed in the ICLA as described in [2] . Thanks > > [1] http://people.apache.org/committer-index.html > [2] http://www.apache.org/licenses/ I have faxed the appropriate documents.
        Hide
        Chris Bowditch added a comment -

        Thanks for the patch Mehdi. Looking at [1] I realised that you haven't yet filed an ICLA. The ASF encourages all contributors to do so and it is a pre-requisite to becoming a committer. Since you have filed a number of patches its about time you filed in the ICLA as described in [2]. Thanks

        [1] http://people.apache.org/committer-index.html
        [2] http://www.apache.org/licenses/

        Show
        Chris Bowditch added a comment - Thanks for the patch Mehdi. Looking at [1] I realised that you haven't yet filed an ICLA. The ASF encourages all contributors to do so and it is a pre-requisite to becoming a committer. Since you have filed a number of patches its about time you filed in the ICLA as described in [2] . Thanks [1] http://people.apache.org/committer-index.html [2] http://www.apache.org/licenses/
        Hide
        Jeremias Maerki added a comment -

        Just some background on the problem:

        It was found that enabling accessibility (tagged PDF) decreases PDF production performance considerably.

        I've profiled FOP with an FO file (about 10 pages). I ran both FO->PDF and FO->IF->PDF scenarios to isolate the bulk of the "lost" time. It turns out that the FO-IF stage doesn't lose a lot of performance due to the additional work. So I concentrated on IF->PDF.

        The VisualVM profiler highlighted PDFDocument.getWriterFor() and BufferedOutputStream.flush() as hot spots in the accessibility case. Most of that is caused by PDFDictionary, PDFArray and PDFName. And the strong weight on these two is actually expected since Tagged PDF structures are all dictionaries and arrays. Lots of them.

        Look at the PDF sizes:

        • Normal PDF: 105 KB (65 PDF Objects)
        • Tagged PDF: 868 KB (6462 PDF Objects)

        That's A LOT of additional content. All dictionaries and arrays that cannot be compressed (in PDF 1.4). That also means a big increase in I/O output. So it's in nature of tagged PDF that it must be considerably slower.

        What I've tried now is to address the hot spot I found above. I got rid of the Writers for encoding text output. Instead I switched to a StringBuilder that is flushed to the OutputStream when necessary. That decreases the average processing time after warm-up (IF->PDF case) from 775ms to 460ms (normal PDF from 355ms to 325ms). That is a speed-up of:

        (460 - 325) / (775 - 355) = 135 / 420 = 0.32 = -68%
        So it cuts the tagged PDF penalty to a third.

        That was the IF->PDF case. Here are the measurements for the FO->PDF case (the same test document:

        normal PDF: 772ms --> 712ms
        tagged PDF: 1472ms --> 1042ms

        normal PDF: 712 / 772 = 0.92 (-8%)
        tagged PDF: 1042 / 1472 = 0.71 (-29%)
        tagged PDF penalty: (1042 - 712) / (1472 - 772) = 330 / 700 = 0.47 (-53%)

        There's a catch: This optimization requires a backwards-incompatible change in the PDF library. The PDFWritable interface changes from
        void outputInline(OutputStream out, Writer writer) throws IOException;
        to
        void outputInline(OutputStream out, StringBuilder textBuffer) throws IOException;

        The same applies to PDFObject.formatObject(). Both are very central parts of the PDF library. It could invalidate pending patches or private additions from third-parties. But it doesn't seem to be easy enough to write adapter code to work around this.

        Show
        Jeremias Maerki added a comment - Just some background on the problem: It was found that enabling accessibility (tagged PDF) decreases PDF production performance considerably. I've profiled FOP with an FO file (about 10 pages). I ran both FO->PDF and FO->IF->PDF scenarios to isolate the bulk of the "lost" time. It turns out that the FO-IF stage doesn't lose a lot of performance due to the additional work. So I concentrated on IF->PDF. The VisualVM profiler highlighted PDFDocument.getWriterFor() and BufferedOutputStream.flush() as hot spots in the accessibility case. Most of that is caused by PDFDictionary, PDFArray and PDFName. And the strong weight on these two is actually expected since Tagged PDF structures are all dictionaries and arrays. Lots of them. Look at the PDF sizes: Normal PDF: 105 KB (65 PDF Objects) Tagged PDF: 868 KB (6462 PDF Objects) That's A LOT of additional content. All dictionaries and arrays that cannot be compressed (in PDF 1.4). That also means a big increase in I/O output. So it's in nature of tagged PDF that it must be considerably slower. What I've tried now is to address the hot spot I found above. I got rid of the Writers for encoding text output. Instead I switched to a StringBuilder that is flushed to the OutputStream when necessary. That decreases the average processing time after warm-up (IF->PDF case) from 775ms to 460ms (normal PDF from 355ms to 325ms). That is a speed-up of: (460 - 325) / (775 - 355) = 135 / 420 = 0.32 = -68% So it cuts the tagged PDF penalty to a third. That was the IF->PDF case. Here are the measurements for the FO->PDF case (the same test document: normal PDF: 772ms --> 712ms tagged PDF: 1472ms --> 1042ms normal PDF: 712 / 772 = 0.92 (-8%) tagged PDF: 1042 / 1472 = 0.71 (-29%) tagged PDF penalty: (1042 - 712) / (1472 - 772) = 330 / 700 = 0.47 (-53%) There's a catch: This optimization requires a backwards-incompatible change in the PDF library. The PDFWritable interface changes from void outputInline(OutputStream out, Writer writer) throws IOException; to void outputInline(OutputStream out, StringBuilder textBuffer) throws IOException; The same applies to PDFObject.formatObject(). Both are very central parts of the PDF library. It could invalidate pending patches or private additions from third-parties. But it doesn't seem to be easy enough to write adapter code to work around this.
        Hide
        Mehdi Houshmand added a comment -

        Attachment moretests.patch has been added with description: More tests

        Show
        Mehdi Houshmand added a comment - Attachment moretests.patch has been added with description: More tests
        Hide
        Mehdi Houshmand added a comment -

        This patch has been separated since it is purely unit tests.

        Show
        Mehdi Houshmand added a comment - This patch has been separated since it is purely unit tests.
        Hide
        Mehdi Houshmand added a comment -

        Attachment performanceandtests.patch has been added with description: patch

        Show
        Mehdi Houshmand added a comment - Attachment performanceandtests.patch has been added with description: patch

          People

          • Assignee:
            fop-dev
            Reporter:
            Mehdi Houshmand
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development