Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1824

[PATCH] CFF fonts render wrong glyphs


    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: None
    • Labels:


      I've found three very closely related CFF encoding issues in v2.0.0 when using PDFToImage.

      Problem 1

      Look a line 7 of the poem, it should be "And the mouldering dust that years have made"
      but instead says "Afld the fioulderiflg dust that years have fiade"

      The CFF font is asseumed to use CIDs but it does not if its not a ROS font.
      Therefore we add a check for CFF ROS class.

      Patch 1 fixes this.

      Problem 2

      Look at line 3 "of right shoice" should be "of right choice".
      Likewise on line 2 of the 2nd paragraph "And a staunsh" should be "And a staunch",
      the st and ch ligatures are incorrect.

      This is because the font is an CFF ROS CID Font and the glyphs for the st and ch ligatures
      both have no name. The CFF format achieves this by using SIDs beyond the size of the string
      index, which map to .notdef. So there is a unique SID for each glyph, but not a unique name.

      Unfortuntely, PDFBox assumes that Type 1 fonts have glyphs with unique names, and this
      assumtion appears throughout the codebase. Because a glyph name and a SID perform essentially
      the same role, I recommend a simple solution to the problem: when an SID beyond the size of
      the string index is encounteted, instead of mapping it to .notdef it should be mapped to
      a new name with the prefix "SID" for example mapping SID 409 to the name "SID409". That way
      each glyph will have a unique name, which is what PDFbox assumes.

      Patch 2 fixes this.

      Problem 3

      Look at line 2, "That creepeth oÉer ruins old!" the word "o'er" is incorrectly rendered
      as "oÉer". This is because the Encoding entry in the PDF maps code 201 from "Eacute" in the
      base encoding to "quoteright", but this is being ignored by PDFBox.

      In the CFFGlyph2D constructor PDFBox examines the font's built-in charset. When the name
      "quoteright" is encountered it is looked up in the PDF Encoding (i.e. nameToCode) where
      it is changed to code 201. Thus code 201 is associated with the "quoteright" glyph in the
      codeToGlyph map. This is correct.

      However, later when the "Eacute" glyph is encountered, its built-in charset code is also
      201 (which is standard) and so the codeToGlyph map entry is overwritten, resulting in
      code 201 being associated with the "Eacute" glyph.

      The solution is to build the codeToGlyph map in a strict order: first populate it with the
      font's built-in charset, then the PDF Encoding overwrites any entries which it defines.

      Patch 3 fixes this (and also replaces patch 2)


        1. bimbo_historia-patched.jpg
          382 kB
          John Hewson
        2. bimbo_historia.patch
          3 kB
          John Hewson
        3. Bimbo_Historia_20070409_Esp.pdf-2-rev-current.png
          933 kB
          Tilman Hausherr
        4. Bimbo_Historia_20070409_Esp.pdf-2-rev-1554775.png
          1.01 MB
          Tilman Hausherr
        5. patched.jpg
          219 kB
          John Hewson
        6. trunk.jpg
          268 kB
          John Hewson
        7. calluna-11.pdf
          146 kB
          John Hewson
        8. all.patch
          3 kB
          John Hewson
        9. 3.patch
          2 kB
          John Hewson
        10. 2.patch
          0.5 kB
          John Hewson
        11. 1.patch
          1 kB
          John Hewson

          Issue Links



              • Assignee:
                lehmi Andreas Lehmkühler
                jahewson John Hewson
              • Votes:
                0 Vote for this issue
                3 Start watching this issue


                • Created: