[PDFBOX-2370] Move caching outside of PDResources - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: PDModel
Labels:
None

Description

Note: This issue is based on a discussion which occurred regarding ~~PDFBOX-2301~~ but is actually a separate issue.

Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created.

What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to "just work" and power users will get a level of control that they appreciate.

This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name).

—

* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so:

for (PDPage page : document.getDocumentCatalog().getAllPages()) // java.util.List
{
     // ... this is idiomatic in PDFBox 1.8
} 
// List returned by getAllPages() kept in scope until here (bad)

I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have:

for (int i = 0; i < document.getNumberOfPages(); i++)
{
    PDPage page = document.getPage(i);
    // ... this is the new 2.0 way
    // current page falls out of scope here (good)
}

To solve this problem, we could change getAllPages() so that instead of returning a List it returns an Iterator<PDPage>, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-2370-002701.pdf
17/Sep/15 16:32
866 kB
Tilman Hausherr

Issue Links

is depended upon by

PDFBOX-2587 PDF takes minutes to convert (sRGB)

Closed

is related to

PDFBOX-2856 Markedly slower processing for particular file in 2.0.0-trunk vs 1.8.9

Closed

PDFBOX-3484 Implement some caching of PDImageXObject

Closed

PDFBOX-2727 Cache color space instances

Closed

relates to

PDFBOX-2323 More flexible image caching (OOM)

Closed

Activity

People

Assignee:: John Hewson

Reporter:: John Hewson

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Sep/14 02:06

Updated:: 08/Sep/16 17:09

Resolved:: 29/Sep/15 00:35