Bug 5335 - Text with embedded CID fonts not retrievable from pdf
Summary: Text with embedded CID fonts not retrievable from pdf
Status: CLOSED FIXED
Alias: None
Product: Fop - Now in Jira
Classification: Unclassified
Component: pdf (show other bugs)
Version: 0.15
Hardware: PC All
: P3 minor
Target Milestone: ---
Assignee: fop-dev
URL:
Keywords:
: 28705 40081 40467 (view as bug list)
Depends on:
Blocks:
 
Reported: 2001-12-10 01:08 UTC by Lukas Pietsch
Modified: 2012-04-01 07:03 UTC (History)
3 users (show)



Attachments
Japanese sample pdf and fo files generated by FOP (112.69 KB, application/octet-stream)
2002-11-25 00:35 UTC, Satoshi Ishigami
Details
FOP 0.20.5 generates ToUnicode maps for CID embedded TTF fonts (21.68 KB, patch)
2005-12-01 15:53 UTC, Adam Strzelecki
Details | Diff
ToUnicode generation for FOP 0.90 TRUNK (24.48 KB, patch)
2005-12-12 19:27 UTC, Adam Strzelecki
Details | Diff
patch for PDFToUnicodeCMap.java, correcting bugs and simplifying the code (5.93 KB, patch)
2006-10-11 01:30 UTC, Vincent Hennebert
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Lukas Pietsch 2001-12-10 01:08:52 UTC
When I embed Truetype fonts as CID fonts in a PDF file with FOP 0.2.0, the 
characters appear fine in the PDF, but copying-and-pasting them from the PDF 
with Acrobat Reader's (5.0) text selection tool does not work. All characters 
are reduced to spaces.
Is this an issue of how the font information gets written into the PDF?
Comment 1 Oleg Tkachenko 2002-11-24 16:07:11 UTC
I'm not sure it's a FOP's problem. Just tested occasional pdf with Japanese
letters - http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf and it behaves
the same way: I cannot copy'n'pase japanese text. I'll close the bug, but if you
think I'm wrong, feel free to reopen it and provide an example, please.
Comment 2 Satoshi Ishigami 2002-11-25 00:33:50 UTC
I think that this is FOP's problem caused by wrong CMap.

I can copy and paste japanese text
http://www.morisawa.co.jp/font/info/pdf/AbtNewCID.pdf
, because this pdf file is collectly generated.

FOP use the value of 'UCS' as the CIDSystemInfo of PDF.
And FOP allocate with original sequential value for each glyph.

When you copy and paste any text, PDFReader try to encode
to correct character encoding (for examples, Adobe-Japan-1),
but there is no information to mapping.

If you want to copy and paste, FOP must create PDF using
correct Encoding and CIDSystemInfo.

These kind of problem's details, see
http://marc.theaimsgroup.com/?l=fop-dev&m=101408636328343&w=2
Comment 3 Satoshi Ishigami 2002-11-25 00:35:08 UTC
Created attachment 3939 [details]
Japanese sample pdf and fo files generated by FOP
Comment 5 Gracjan Polak 2003-01-23 14:46:15 UTC
Sample ToUnicode CMap (with explanations below):

--------
/CIDInit /ProcSet findresource begin 
12 dict begin 
begincmap 
/CIDSystemInfo <<
/Registry (UniqueName) /Ordering (FOP) /Supplement 0 >> def
/CMapName /UniqueName def
/CMapType 2 def
1 begincodespacerange 
<0000> <ffff> 
endcodespacerange
6 beginbfchar
<0005> <0041>
<0006> <0042>
<0007> <0043>
<0008> <0044>
<0009> <0045>
<000A> <0046>
endbfchar
endcmap 
CMapName currentdict /CMap defineresource pop 
end end
---------

Looks like normal CMap. I'll point out some differences:

/Registry and /CMapName must be unique and equal, so make up some names. 
Font name for which ToUnicode cmap was generated is ok here. Just copy value
/Name from /Font dictionary.

CMapType for ToUnicode CMaps is 2.

begincodespacerange must be as shown above.

beginbfchar section has format:
<glyph> <unicode>

There can be more beginbfchar sections, each can have no more than 100 entries. 
Section in sample says that glyph 5 is A, glyph 6 is B and so on.

There are also beginbfrange sections, but I dont think they will be of any use here.

It is better not to emit structural comments (%%) or /XUID or /UIDOffset or all
the other crap. There are useless here and misleading.

All above taken from:
http://partners.adobe.com:80/asn/developer/pdfs/tn/5411.ToUnicode.pdf
and checked by hand. It works at least with Acrobat Reader 5.0.

Do I look like I wanted this feature badly?:-) Mail me for more info.
Comment 6 Jeremias Maerki 2003-01-23 14:57:08 UTC
You sound like you're up to the task of doing a fix for this. Wanna try? Please 
realize that we're already short on resources and you shouldn't count on 
anybody fixing this soon enough for you. OpenSource sometimes involves doing 
something yourself and send in a patch for functionality badly needed. Thanks 
for understanding.
Comment 7 Jeremias Maerki 2004-09-23 10:18:41 UTC
*** Bug 28705 has been marked as a duplicate of this bug. ***
Comment 8 Victor Mote 2004-09-23 13:52:14 UTC
This issue has been fixed in FOray CVS, and should be available in FOray 0.2: 
http://www.foray.org/release.html.
Comment 9 Adam Strzelecki 2005-12-01 15:53:51 UTC
Created attachment 17104 [details]
FOP 0.20.5 generates ToUnicode maps for CID embedded TTF fonts

This is a patch that makes FOP 0.20.5 generate ToUnicode maps for CID embedded
TTF fonts. This patch is using some code from FOray.
With that patch you can finally cut / copy text from PDFs with embedded TTF CID
fonts generated by FOP. So you don't need -enc ansi fonts anymore.
Comment 10 Adam Strzelecki 2005-12-12 19:27:25 UTC
Created attachment 17203 [details]
ToUnicode generation for FOP 0.90 TRUNK

This is ToUnicode patch version for 0.90 TRUNK. Can be included before FOray
changes will arrive so users can have Copy & Paste.
Comment 11 Jeremias Maerki 2006-08-02 19:30:45 UTC
*** Bug 40081 has been marked as a duplicate of this bug. ***
Comment 12 Bertrand Delacretaz 2006-09-11 15:32:15 UTC
The attachment of comment #10 apparently contains code coming from FOray, see
bug #40467
Comment 13 Victor Mote 2006-09-11 16:17:15 UTC
(In reply to comment #12)
> The attachment of comment #10 apparently contains code coming from FOray, see
> bug #40467

Recent updates to this thread were brought to my attention for the purpose of
obtaining permission for FOP to use the mentioned FOray code. FOP may freely use
the code mentioned or any other FOray code as it wishes, and FOP may consider it
contributed by FOray.

It pains me a bit that we are doing the cut-and-paste thing, but this is perhaps
my fault for being so slow to release FOray 0.2. I did an aXSL release last week
and hope to do a FOray release within the next two weeks, not because FOray as a
whole is ready to release, but I think the font code is.
Comment 14 Bertrand Delacretaz 2006-09-21 13:05:02 UTC
Problem can be reproduced by generating the barcode.fo example, see the comments
about fo.example properties inside examples/fo/advanced/barcode.fo for how to do
this.

Currently, the barcode text ("123456") is not found in the barcode when
searching in the generated PDF from Acrobat reader.
Comment 15 Bertrand Delacretaz 2006-10-10 06:10:21 UTC
I have applied the patch from comment #10 in revision 454725.

Tested with several embedded TTF fonts, and OpenType fonts with TTF outlines,
using an XSL-FO document containing accented characters:

-Copy and paste from the PDF file works.
-pdftotext extracts text correctly.

Vincent told me about an "Illegal entry in bfrange block in ToUnicode CMap"
error when opening a generated PDF file with xpdf, but I haven't been able to
reproduce it yet.

FWIW, here's the output of pdffonts on one of my test files:

Helvetica-Bold                       Type 1       no  no  no      19  0
3E5537MSGothic                       CID TrueType yes no  yes     16  0
Times-Roman                          Type 1       no  no  no      20  0
4E5583PMingLiU                       CID TrueType yes no  yes     24  0
2E54a8Gulim                          CID TrueType yes no  yes     30  0
1E548dRockwell                       CID TrueType yes no  yes     36  0
Comment 16 Bertrand Delacretaz 2006-10-10 06:15:18 UTC
The correct revision is 454731, I forgot a file in the previous commit.
Comment 17 Vincent Hennebert 2006-10-11 01:30:43 UTC
Created attachment 18987 [details]
patch for PDFToUnicodeCMap.java, correcting bugs and simplifying the code

Here's a patch that improves PDFToUnicodeCMap.java. Changelog:
- bugfix: the last range in the bfrange entries was ending at -1, translated to
FFFFFFFF in the PDF file. That's what was causing the error "Illegal entry in
bfrange block in ToUnicode CMap"
- bugfix: if there were more than 100 ranges, the beginbfrange...endbfrange
sections would have been wrongly generated
- some code cleanup and simplification

Vincent
Comment 18 Bertrand Delacretaz 2006-10-11 02:06:36 UTC
Patch of comment #17 applied in revision 462741, thanks!
Comment 19 Bertrand Delacretaz 2007-04-09 02:18:55 UTC
I think this issue can be closed now, please reopen if I'm wrong
Comment 20 Bertrand Delacretaz 2007-04-09 02:19:20 UTC
*** Bug 40467 has been marked as a duplicate of this bug. ***
Comment 21 Glenn Adams 2012-04-01 07:03:20 UTC
batch transition pre-FOP1.0 resolved+fixed bugs to closed+fixed