When I embed Truetype fonts as CID fonts in a PDF file with FOP 0.2.0, the characters appear fine in the PDF, but copying-and-pasting them from the PDF with Acrobat Reader's (5.0) text selection tool does not work. All characters are reduced to spaces. Is this an issue of how the font information gets written into the PDF?
I'm not sure it's a FOP's problem. Just tested occasional pdf with Japanese letters - http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf and it behaves the same way: I cannot copy'n'pase japanese text. I'll close the bug, but if you think I'm wrong, feel free to reopen it and provide an example, please.
I think that this is FOP's problem caused by wrong CMap. I can copy and paste japanese text http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf , because this pdf file is collectly generated. FOP use the value of 'UCS' as the CIDSystemInfo of PDF. And FOP allocate with original sequential value for each glyph. When you copy and paste any text, PDFReader try to encode to correct character encoding (for examples, Adobe-Japan-1), but there is no information to mapping. If you want to copy and paste, FOP must create PDF using correct Encoding and CIDSystemInfo. These kind of problem's details, see http://marc.theaimsgroup.com/?l=fop-dev&m=101408636328343&w=2
Created attachment 3939 [details] Japanese sample pdf and fo files generated by FOP
See http://marc.theaimsgroup.com/?l=fop-dev&m=103839237726705&w=2
Sample ToUnicode CMap (with explanations below): -------- /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo << /Registry (UniqueName) /Ordering (FOP) /Supplement 0 >> def /CMapName /UniqueName def /CMapType 2 def 1 begincodespacerange <0000> <ffff> endcodespacerange 6 beginbfchar <0005> <0041> <0006> <0042> <0007> <0043> <0008> <0044> <0009> <0045> <000A> <0046> endbfchar endcmap CMapName currentdict /CMap defineresource pop end end --------- Looks like normal CMap. I'll point out some differences: /Registry and /CMapName must be unique and equal, so make up some names. Font name for which ToUnicode cmap was generated is ok here. Just copy value /Name from /Font dictionary. CMapType for ToUnicode CMaps is 2. begincodespacerange must be as shown above. beginbfchar section has format: <glyph> <unicode> There can be more beginbfchar sections, each can have no more than 100 entries. Section in sample says that glyph 5 is A, glyph 6 is B and so on. There are also beginbfrange sections, but I dont think they will be of any use here. It is better not to emit structural comments (%%) or /XUID or /UIDOffset or all the other crap. There are useless here and misleading. All above taken from: http://partners.adobe.com:80/asn/developer/pdfs/tn/5411.ToUnicode.pdf and checked by hand. It works at least with Acrobat Reader 5.0. Do I look like I wanted this feature badly?:-) Mail me for more info.
You sound like you're up to the task of doing a fix for this. Wanna try? Please realize that we're already short on resources and you shouldn't count on anybody fixing this soon enough for you. OpenSource sometimes involves doing something yourself and send in a patch for functionality badly needed. Thanks for understanding.
*** Bug 28705 has been marked as a duplicate of this bug. ***
This issue has been fixed in FOray CVS, and should be available in FOray 0.2: http://www.foray.org/release.html.
Created attachment 17104 [details] FOP 0.20.5 generates ToUnicode maps for CID embedded TTF fonts This is a patch that makes FOP 0.20.5 generate ToUnicode maps for CID embedded TTF fonts. This patch is using some code from FOray. With that patch you can finally cut / copy text from PDFs with embedded TTF CID fonts generated by FOP. So you don't need -enc ansi fonts anymore.
Created attachment 17203 [details] ToUnicode generation for FOP 0.90 TRUNK This is ToUnicode patch version for 0.90 TRUNK. Can be included before FOray changes will arrive so users can have Copy & Paste.
*** Bug 40081 has been marked as a duplicate of this bug. ***
The attachment of comment #10 apparently contains code coming from FOray, see bug #40467
(In reply to comment #12) > The attachment of comment #10 apparently contains code coming from FOray, see > bug #40467 Recent updates to this thread were brought to my attention for the purpose of obtaining permission for FOP to use the mentioned FOray code. FOP may freely use the code mentioned or any other FOray code as it wishes, and FOP may consider it contributed by FOray. It pains me a bit that we are doing the cut-and-paste thing, but this is perhaps my fault for being so slow to release FOray 0.2. I did an aXSL release last week and hope to do a FOray release within the next two weeks, not because FOray as a whole is ready to release, but I think the font code is.
Problem can be reproduced by generating the barcode.fo example, see the comments about fo.example properties inside examples/fo/advanced/barcode.fo for how to do this. Currently, the barcode text ("123456") is not found in the barcode when searching in the generated PDF from Acrobat reader.
I have applied the patch from comment #10 in revision 454725. Tested with several embedded TTF fonts, and OpenType fonts with TTF outlines, using an XSL-FO document containing accented characters: -Copy and paste from the PDF file works. -pdftotext extracts text correctly. Vincent told me about an "Illegal entry in bfrange block in ToUnicode CMap" error when opening a generated PDF file with xpdf, but I haven't been able to reproduce it yet. FWIW, here's the output of pdffonts on one of my test files: Helvetica-Bold Type 1 no no no 19 0 3E5537MSGothic CID TrueType yes no yes 16 0 Times-Roman Type 1 no no no 20 0 4E5583PMingLiU CID TrueType yes no yes 24 0 2E54a8Gulim CID TrueType yes no yes 30 0 1E548dRockwell CID TrueType yes no yes 36 0
The correct revision is 454731, I forgot a file in the previous commit.
Created attachment 18987 [details] patch for PDFToUnicodeCMap.java, correcting bugs and simplifying the code Here's a patch that improves PDFToUnicodeCMap.java. Changelog: - bugfix: the last range in the bfrange entries was ending at -1, translated to FFFFFFFF in the PDF file. That's what was causing the error "Illegal entry in bfrange block in ToUnicode CMap" - bugfix: if there were more than 100 ranges, the beginbfrange...endbfrange sections would have been wrongly generated - some code cleanup and simplification Vincent
Patch of comment #17 applied in revision 462741, thanks!
I think this issue can be closed now, please reopen if I'm wrong
*** Bug 40467 has been marked as a duplicate of this bug. ***
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed