Fop
  1. Fop
  2. FOP-230

Text with embedded CID fonts not retrievable from pdf

    Details

    • Type: Bug Bug
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: 0.15
    • Fix Version/s: None
    • Component/s: pdf
    • Labels:
      None
    • Environment:
      Operating System: All
      Platform: PC
    • External issue ID:
      5335

      Description

      When I embed Truetype fonts as CID fonts in a PDF file with FOP 0.2.0, the
      characters appear fine in the PDF, but copying-and-pasting them from the PDF
      with Acrobat Reader's (5.0) text selection tool does not work. All characters
      are reduced to spaces.
      Is this an issue of how the font information gets written into the PDF?

      1. examples.tar.gz
        113 kB
        Satoshi Ishigami
      2. fop-0.20.5-toUnicodeCMap.patch
        22 kB
        Adam Strzelecki
      3. fop-0.90-trunk-toUnicodeCMap.patch
        24 kB
        Adam Strzelecki
      4. toUnicodeCMap.patch
        6 kB
        Vincent Hennebert

        Issue Links

          Activity

          Hide
          Glenn Adams added a comment -

          batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed

          Show
          Glenn Adams added a comment - batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed
          Hide
          Bertrand Delacretaz added a comment -
              • FOP-1238 has been marked as a duplicate of this bug. ***
          Show
          Bertrand Delacretaz added a comment - FOP-1238 has been marked as a duplicate of this bug. ***
          Hide
          Bertrand Delacretaz added a comment -

          I think this issue can be closed now, please reopen if I'm wrong

          Show
          Bertrand Delacretaz added a comment - I think this issue can be closed now, please reopen if I'm wrong
          Hide
          Bertrand Delacretaz added a comment -

          Patch of comment #17 applied in revision 462741, thanks!

          Show
          Bertrand Delacretaz added a comment - Patch of comment #17 applied in revision 462741, thanks!
          Hide
          Vincent Hennebert added a comment -

          Attachment toUnicodeCMap.patch has been added with description: patch for PDFToUnicodeCMap.java, correcting bugs and simplifying the code

          Show
          Vincent Hennebert added a comment - Attachment toUnicodeCMap.patch has been added with description: patch for PDFToUnicodeCMap.java, correcting bugs and simplifying the code
          Hide
          Vincent Hennebert added a comment -

          Here's a patch that improves PDFToUnicodeCMap.java. Changelog:

          • bugfix: the last range in the bfrange entries was ending at -1, translated to
            FFFFFFFF in the PDF file. That's what was causing the error "Illegal entry in
            bfrange block in ToUnicode CMap"
          • bugfix: if there were more than 100 ranges, the beginbfrange...endbfrange
            sections would have been wrongly generated
          • some code cleanup and simplification

          Vincent

          Show
          Vincent Hennebert added a comment - Here's a patch that improves PDFToUnicodeCMap.java. Changelog: bugfix: the last range in the bfrange entries was ending at -1, translated to FFFFFFFF in the PDF file. That's what was causing the error "Illegal entry in bfrange block in ToUnicode CMap" bugfix: if there were more than 100 ranges, the beginbfrange...endbfrange sections would have been wrongly generated some code cleanup and simplification Vincent
          Hide
          Bertrand Delacretaz added a comment -

          The correct revision is 454731, I forgot a file in the previous commit.

          Show
          Bertrand Delacretaz added a comment - The correct revision is 454731, I forgot a file in the previous commit.
          Hide
          Bertrand Delacretaz added a comment -

          I have applied the patch from comment #10 in revision 454725.

          Tested with several embedded TTF fonts, and OpenType fonts with TTF outlines,
          using an XSL-FO document containing accented characters:

          -Copy and paste from the PDF file works.
          -pdftotext extracts text correctly.

          Vincent told me about an "Illegal entry in bfrange block in ToUnicode CMap"
          error when opening a generated PDF file with xpdf, but I haven't been able to
          reproduce it yet.

          FWIW, here's the output of pdffonts on one of my test files:

          Helvetica-Bold Type 1 no no no 19 0
          3E5537MSGothic CID TrueType yes no yes 16 0
          Times-Roman Type 1 no no no 20 0
          4E5583PMingLiU CID TrueType yes no yes 24 0
          2E54a8Gulim CID TrueType yes no yes 30 0
          1E548dRockwell CID TrueType yes no yes 36 0

          Show
          Bertrand Delacretaz added a comment - I have applied the patch from comment #10 in revision 454725. Tested with several embedded TTF fonts, and OpenType fonts with TTF outlines, using an XSL-FO document containing accented characters: -Copy and paste from the PDF file works. -pdftotext extracts text correctly. Vincent told me about an "Illegal entry in bfrange block in ToUnicode CMap" error when opening a generated PDF file with xpdf, but I haven't been able to reproduce it yet. FWIW, here's the output of pdffonts on one of my test files: Helvetica-Bold Type 1 no no no 19 0 3E5537MSGothic CID TrueType yes no yes 16 0 Times-Roman Type 1 no no no 20 0 4E5583PMingLiU CID TrueType yes no yes 24 0 2E54a8Gulim CID TrueType yes no yes 30 0 1E548dRockwell CID TrueType yes no yes 36 0
          Hide
          Bertrand Delacretaz added a comment -

          Problem can be reproduced by generating the barcode.fo example, see the comments
          about fo.example properties inside examples/fo/advanced/barcode.fo for how to do
          this.

          Currently, the barcode text ("123456") is not found in the barcode when
          searching in the generated PDF from Acrobat reader.

          Show
          Bertrand Delacretaz added a comment - Problem can be reproduced by generating the barcode.fo example, see the comments about fo.example properties inside examples/fo/advanced/barcode.fo for how to do this. Currently, the barcode text ("123456") is not found in the barcode when searching in the generated PDF from Acrobat reader.
          Hide
          Victor Mote added a comment -

          (In reply to comment #12)
          > The attachment of comment #10 apparently contains code coming from FOray, see
          > FOP-1238

          Recent updates to this thread were brought to my attention for the purpose of
          obtaining permission for FOP to use the mentioned FOray code. FOP may freely use
          the code mentioned or any other FOray code as it wishes, and FOP may consider it
          contributed by FOray.

          It pains me a bit that we are doing the cut-and-paste thing, but this is perhaps
          my fault for being so slow to release FOray 0.2. I did an aXSL release last week
          and hope to do a FOray release within the next two weeks, not because FOray as a
          whole is ready to release, but I think the font code is.

          Show
          Victor Mote added a comment - (In reply to comment #12) > The attachment of comment #10 apparently contains code coming from FOray, see > FOP-1238 Recent updates to this thread were brought to my attention for the purpose of obtaining permission for FOP to use the mentioned FOray code. FOP may freely use the code mentioned or any other FOray code as it wishes, and FOP may consider it contributed by FOray. It pains me a bit that we are doing the cut-and-paste thing, but this is perhaps my fault for being so slow to release FOray 0.2. I did an aXSL release last week and hope to do a FOray release within the next two weeks, not because FOray as a whole is ready to release, but I think the font code is.
          Hide
          Bertrand Delacretaz added a comment -

          The attachment of comment #10 apparently contains code coming from FOray, see
          FOP-1238

          Show
          Bertrand Delacretaz added a comment - The attachment of comment #10 apparently contains code coming from FOray, see FOP-1238
          Hide
          Jeremias Maerki added a comment -
              • FOP-1213 has been marked as a duplicate of this bug. ***
          Show
          Jeremias Maerki added a comment - FOP-1213 has been marked as a duplicate of this bug. ***
          Hide
          Adam Strzelecki added a comment -

          Attachment fop-0.90-trunk-toUnicodeCMap.patch has been added with description: ToUnicode generation for FOP 0.90 TRUNK

          Show
          Adam Strzelecki added a comment - Attachment fop-0.90-trunk-toUnicodeCMap.patch has been added with description: ToUnicode generation for FOP 0.90 TRUNK
          Hide
          Adam Strzelecki added a comment -

          This is ToUnicode patch version for 0.90 TRUNK. Can be included before FOray
          changes will arrive so users can have Copy & Paste.

          Show
          Adam Strzelecki added a comment - This is ToUnicode patch version for 0.90 TRUNK. Can be included before FOray changes will arrive so users can have Copy & Paste.
          Hide
          Adam Strzelecki added a comment -

          Attachment fop-0.20.5-toUnicodeCMap.patch has been added with description: FOP 0.20.5 generates ToUnicode maps for CID embedded TTF fonts

          Show
          Adam Strzelecki added a comment - Attachment fop-0.20.5-toUnicodeCMap.patch has been added with description: FOP 0.20.5 generates ToUnicode maps for CID embedded TTF fonts
          Hide
          Adam Strzelecki added a comment -

          This is a patch that makes FOP 0.20.5 generate ToUnicode maps for CID embedded
          TTF fonts. This patch is using some code from FOray.
          With that patch you can finally cut / copy text from PDFs with embedded TTF CID
          fonts generated by FOP. So you don't need -enc ansi fonts anymore.

          Show
          Adam Strzelecki added a comment - This is a patch that makes FOP 0.20.5 generate ToUnicode maps for CID embedded TTF fonts. This patch is using some code from FOray. With that patch you can finally cut / copy text from PDFs with embedded TTF CID fonts generated by FOP. So you don't need -enc ansi fonts anymore.
          Hide
          Victor Mote added a comment -

          This issue has been fixed in FOray CVS, and should be available in FOray 0.2:
          http://www.foray.org/release.html.

          Show
          Victor Mote added a comment - This issue has been fixed in FOray CVS, and should be available in FOray 0.2: http://www.foray.org/release.html .
          Hide
          Jeremias Maerki added a comment -
              • FOP-877 has been marked as a duplicate of this bug. ***
          Show
          Jeremias Maerki added a comment - FOP-877 has been marked as a duplicate of this bug. ***
          Hide
          Jeremias Maerki added a comment -

          You sound like you're up to the task of doing a fix for this. Wanna try? Please
          realize that we're already short on resources and you shouldn't count on
          anybody fixing this soon enough for you. OpenSource sometimes involves doing
          something yourself and send in a patch for functionality badly needed. Thanks
          for understanding.

          Show
          Jeremias Maerki added a comment - You sound like you're up to the task of doing a fix for this. Wanna try? Please realize that we're already short on resources and you shouldn't count on anybody fixing this soon enough for you. OpenSource sometimes involves doing something yourself and send in a patch for functionality badly needed. Thanks for understanding.
          Hide
          Gracjan Polak added a comment -

          Sample ToUnicode CMap (with explanations below):

          --------
          /CIDInit /ProcSet findresource begin
          12 dict begin
          begincmap
          /CIDSystemInfo <<
          /Registry (UniqueName) /Ordering (FOP) /Supplement 0 >> def
          /CMapName /UniqueName def
          /CMapType 2 def
          1 begincodespacerange
          <0000> <ffff>
          endcodespacerange
          6 beginbfchar
          <0005> <0041>
          <0006> <0042>
          <0007> <0043>
          <0008> <0044>
          <0009> <0045>
          <000A> <0046>
          endbfchar
          endcmap
          CMapName currentdict /CMap defineresource pop
          end end
          ---------

          Looks like normal CMap. I'll point out some differences:

          /Registry and /CMapName must be unique and equal, so make up some names.
          Font name for which ToUnicode cmap was generated is ok here. Just copy value
          /Name from /Font dictionary.

          CMapType for ToUnicode CMaps is 2.

          begincodespacerange must be as shown above.

          beginbfchar section has format:
          <glyph> <unicode>

          There can be more beginbfchar sections, each can have no more than 100 entries.
          Section in sample says that glyph 5 is A, glyph 6 is B and so on.

          There are also beginbfrange sections, but I dont think they will be of any use here.

          It is better not to emit structural comments (%%) or /XUID or /UIDOffset or all
          the other crap. There are useless here and misleading.

          All above taken from:
          http://partners.adobe.com:80/asn/developer/pdfs/tn/5411.ToUnicode.pdf
          and checked by hand. It works at least with Acrobat Reader 5.0.

          Do I look like I wanted this feature badly? Mail me for more info.

          Show
          Gracjan Polak added a comment - Sample ToUnicode CMap (with explanations below): -------- /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (UniqueName) /Ordering (FOP) /Supplement 0 >> def /CMapName /UniqueName def /CMapType 2 def 1 begincodespacerange <0000> <ffff> endcodespacerange 6 beginbfchar <0005> <0041> <0006> <0042> <0007> <0043> <0008> <0044> <0009> <0045> <000A> <0046> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end --------- Looks like normal CMap. I'll point out some differences: /Registry and /CMapName must be unique and equal, so make up some names. Font name for which ToUnicode cmap was generated is ok here. Just copy value /Name from /Font dictionary. CMapType for ToUnicode CMaps is 2. begincodespacerange must be as shown above. beginbfchar section has format: <glyph> <unicode> There can be more beginbfchar sections, each can have no more than 100 entries. Section in sample says that glyph 5 is A, glyph 6 is B and so on. There are also beginbfrange sections, but I dont think they will be of any use here. It is better not to emit structural comments (%%) or /XUID or /UIDOffset or all the other crap. There are useless here and misleading. All above taken from: http://partners.adobe.com:80/asn/developer/pdfs/tn/5411.ToUnicode.pdf and checked by hand. It works at least with Acrobat Reader 5.0. Do I look like I wanted this feature badly? Mail me for more info.
          Show
          Oleg Tkachenko added a comment - See http://marc.theaimsgroup.com/?l=fop-dev&m=103839237726705&w=2
          Hide
          Satoshi Ishigami added a comment -

          Attachment examples.tar.gz has been added with description: Japanese sample pdf and fo files generated by FOP

          Show
          Satoshi Ishigami added a comment - Attachment examples.tar.gz has been added with description: Japanese sample pdf and fo files generated by FOP
          Hide
          Satoshi Ishigami added a comment -

          I think that this is FOP's problem caused by wrong CMap.

          I can copy and paste japanese text
          http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf
          , because this pdf file is collectly generated.

          FOP use the value of 'UCS' as the CIDSystemInfo of PDF.
          And FOP allocate with original sequential value for each glyph.

          When you copy and paste any text, PDFReader try to encode
          to correct character encoding (for examples, Adobe-Japan-1),
          but there is no information to mapping.

          If you want to copy and paste, FOP must create PDF using
          correct Encoding and CIDSystemInfo.

          These kind of problem's details, see
          http://marc.theaimsgroup.com/?l=fop-dev&m=101408636328343&w=2

          Show
          Satoshi Ishigami added a comment - I think that this is FOP's problem caused by wrong CMap. I can copy and paste japanese text http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf , because this pdf file is collectly generated. FOP use the value of 'UCS' as the CIDSystemInfo of PDF. And FOP allocate with original sequential value for each glyph. When you copy and paste any text, PDFReader try to encode to correct character encoding (for examples, Adobe-Japan-1), but there is no information to mapping. If you want to copy and paste, FOP must create PDF using correct Encoding and CIDSystemInfo. These kind of problem's details, see http://marc.theaimsgroup.com/?l=fop-dev&m=101408636328343&w=2
          Hide
          Oleg Tkachenko added a comment -

          I'm not sure it's a FOP's problem. Just tested occasional pdf with Japanese
          letters - http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf and it behaves
          the same way: I cannot copy'n'pase japanese text. I'll close the bug, but if you
          think I'm wrong, feel free to reopen it and provide an example, please.

          Show
          Oleg Tkachenko added a comment - I'm not sure it's a FOP's problem. Just tested occasional pdf with Japanese letters - http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf and it behaves the same way: I cannot copy'n'pase japanese text. I'll close the bug, but if you think I'm wrong, feel free to reopen it and provide an example, please.

            People

            • Assignee:
              fop-dev
              Reporter:
              Lukas Pietsch
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development