Note: This issue is based on a discussion which occurred regarding
PDFBOX-2301 but is actually a separate issue.
Currently we cache the page resources in PDResources which belongs to a specific PDPage. This causes two problems, 1) users who want to hold many PDPage objects in memory will have high memory use (but this is often by accident*). 2) By caching resources in PDPage we only get to keep that cache for the lifetime of the page, which e.g. in PDFRenderer is a single page only. That means that a font which appears on 40 pages has to be parsed 40 times, which causes slow running times, but also memory thrashing as objects are destroyed frequently only to be re-created.
What PDFRenderer really needs is not page-wide caching but document-wide caching, so that it can cache fonts, cmaps, color profiles, etc. only once. But that won't work for images, because they're too large. What we're beginning to realise is that caching is use-case specific and probably shouldn't be built-in to PDFBox's pdmodel. Instead we should removing resource caching from PDPage/PDResources and implement custom caching in PDFRenderer and other downstream classes such as PDFTextStripper. I'll happily volunteer myself. The existing high-level PDFBox APIs will continue to "just work" and power users will get a level of control that they appreciate.
This strategy could be enhanced by removing memory-hungry methods on PDResources such as getFonts() and getXObjects() which force all resources of a particular type to be loaded, whether or not they are needed, or actually used in the content stream. They would be replaced by methods to retrieve a single resource, e.g. getFont(name).
* There probably isn't a legitimate use case for 1) any more, we've solved the issues which we used to have with image caching (in fact, the clearCache() method actually no longer needs to be called by PDFRenderer, though it currently is). The real problem is that it's easy to accidentally retain PDPage objects, the PDDocument#getDocumentCatalog().getAllPages() method is dangerous as looping over it will cause pages to be retained during processing, like so:
I added of couple of methods a while ago to avoid this by fetching each PDPage one at a time, and this is now used internally in PDFBox to avoid the memory problems we used to have:
To solve this problem, we could change getAllPages() so that instead of returning a List it returns an Iterator<PDPage>, which would provide a nicer API than getPage(int) and most existing code will continue to work. This is also an opportunity to also fix type safety issues due to PDPageNode and incorrect handling of the page tree (this is similar to the issue we had recently with the acroform field tree).