[PDFBOX-610] Fonts should not be cached by PDFStreamEngine - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.0-incubator
Fix Version/s: 1.7.0
Component/s: Text extraction
Labels:
- PDFStreamEngine
- fontwidth
Environment:
Win or Linux

Description

org.apache.pdfbox.util.PDFStreamEngine
Fonts are cached using variable 'private Map documentFontCache = new HashMap();'
which is used in method 'processSubStream()' and the call 'sr.fonts = resources.getFonts(documentFontCache);

The problem is that PDF documents can store a limited range of 'firstChar' and 'lastChar' (maybe just a space char), and then expand that range at a later point within the same page. When the font is cached, those updates are ignored.

In particular, test 'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf, pg 1'.
Using font caching, the widths of the characters in the upper right corner of the page are reported as zero, and the text extraction and text merging is compromised.
Without font caching, the widths are correct. There are other examples that cause the same problem.

To fix the problem change the call in method 'processSubStream()' to:
sr.fonts = resources.getFonts(null);

There was some effort put into font caching. Unfortunately, it should not be used on unknown documents.

Attachments

Issue Links

is related to

PDFBOX-3024 Preflight validation call PDType0Font.clear at the wrong time

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Peter Costello

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 07/Feb/10 06:06

Updated:: 11/Feb/16 16:56

Resolved:: 22/Jan/12 13:25

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified