[PDFBOX-1824] [PATCH] CFF fonts render wrong glyphs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: None
Labels:
- patch

Description

I've found three very closely related CFF encoding issues in v2.0.0 when using PDFToImage.

Problem 1
---------

Look a line 7 of the poem, it should be "And the mouldering dust that years have made"
but instead says "Afld the fioulderiflg dust that years have fiade"

The CFF font is asseumed to use CIDs but it does not if its not a ROS font.
Therefore we add a check for CFF ROS class.

Patch 1 fixes this.

Problem 2
---------

Look at line 3 "of right shoice" should be "of right choice".
Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a staunch",
the st and ch ligatures are incorrect.

This is because the font is an CFF ROS CID Font and the glyphs for the st and ch ligatures
both have no name. The CFF format achieves this by using SIDs beyond the size of the string
index, which map to .notdef. So there is a unique SID for each glyph, but not a unique name.

Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique names, and this
assumtion appears throughout the codebase. Because a glyph name and a SID perform essentially
the same role, I recommend a simple solution to the problem: when an SID beyond the size of
the string index is encounteted, instead of mapping it to .notdef it should be mapped to
a new name with the prefix "SID" for example mapping SID 409 to the name "SID409". That way
each glyph will have a unique name, which is what PDFbox assumes.

Patch 2 fixes this.

Problem 3
---------

Look at line 2, "That creepeth oÉer ruins old!" the word "o'er" is incorrectly rendered
as "oÉer". This is because the Encoding entry in the PDF maps code 201 from "Eacute" in the
base encoding to "quoteright", but this is being ignored by PDFBox.

In the CFFGlyph2D constructor PDFBox examines the font's built-in charset. When the name
"quoteright" is encountered it is looked up in the PDF Encoding (i.e. nameToCode) where
it is changed to code 201. Thus code 201 is associated with the "quoteright" glyph in the
codeToGlyph map. This is correct.

However, later when the "Eacute" glyph is encountered, its built-in charset code is also
201 (which is standard) and so the codeToGlyph map entry is overwritten, resulting in
code 201 being associated with the "Eacute" glyph.

The solution is to build the codeToGlyph map in a strict order: first populate it with the
font's built-in charset, then the PDF Encoding overwrites any entries which it defines.

Patch 3 fixes this (and also replaces patch 2)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

1.patch
02/Jan/14 08:34
1 kB
John Hewson
2.patch
02/Jan/14 08:34
0.5 kB
John Hewson
3.patch
02/Jan/14 08:34
2 kB
John Hewson
all.patch
02/Jan/14 08:34
3 kB
John Hewson
Bimbo_Historia_20070409_Esp.pdf-2-rev-1554775.png
02/Jan/14 22:02
1.01 MB
Tilman Hausherr
Bimbo_Historia_20070409_Esp.pdf-2-rev-current.png
02/Jan/14 22:02
933 kB
Tilman Hausherr
bimbo_historia.patch
03/Jan/14 18:55
3 kB
John Hewson
bimbo_historia-patched.jpg
03/Jan/14 18:59
382 kB
John Hewson
calluna-11.pdf
02/Jan/14 08:35
146 kB
John Hewson
patched.jpg
02/Jan/14 08:36
219 kB
John Hewson
trunk.jpg
02/Jan/14 08:36
268 kB
John Hewson

Issue Links

is duplicated by

PDFBOX-1697 the text show Incorrect

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: John Hewson

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Jan/14 08:29

Updated:: 17/Mar/16 19:08

Resolved:: 04/Jan/14 13:39