PDFBox
  1. PDFBox
  2. PDFBOX-922

True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.3.1
    • Fix Version/s: 2.0.0
    • Component/s: Writing
    • Labels:
      None
    • Environment:
      JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0

      Description

      PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding. This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards via PDFont.setFont() )

      This excludes the following languages plus many others:

      • Greek
      • Bulgarian
      • Swedish
      • Baltic languages
      • Malteze

      The PDF created contains garbled characters and/or squares.

      Simple test case:

                      PDDocument doc = null;
      		try {
      			doc = new PDDocument();
      			PDPage page = new PDPage();
      			doc.addPage(page);
      			// extract fonts for fields
      			byte[] arialNorm = extractFont("arial.ttf");
      			//byte[] arialBold = extractFont("arialbd.ttf"); 
      			//PDFont font = PDType1Font.HELVETICA;
      			PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm));
      			
      			PDPageContentStream contentStream = new PDPageContentStream(doc, page);
      			contentStream.beginText();
      			contentStream.setFont(font, 12);
      			contentStream.moveTextPositionByAmount(100, 700);
      			contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here may appear garbled; insert any text in Greek or Bulgarian or Malteze
      			contentStream.endText();
      			contentStream.close();
      			doc.save("pdfbox.pdf");
      			System.out.println(" created!");
      		} catch (Exception ioe) {
      			ioe.printStackTrace();
      		} finally {
      			if (doc != null) {
      				try { doc.close(); } catch (Exception e) {}
      			}
      		}
      
      1. pdfbox-unicode2.diff
        14 kB
        Antti Lankila
      2. pdfbox-unicode.diff
        11 kB
        Antti Lankila

        Issue Links

          Activity

          Hide
          Thanos Agelatos added a comment -

          No PDF expert but would the 'reverse' of work done in PDFBOX-654 be sufficient to be able to encode Identity-H in new PDFs?

          Show
          Thanos Agelatos added a comment - No PDF expert but would the 'reverse' of work done in PDFBOX-654 be sufficient to be able to encode Identity-H in new PDFs?
          Hide
          Andreas Lehmkühler added a comment -

          PDFBOX-654 is about text extraction. But you are correct, WinAnsiEncoding is hardcoded inside PDTrueTypeFont. For now PDFBox hasn't any support for Identiy-H as encoding when adding text.

          Show
          Andreas Lehmkühler added a comment - PDFBOX-654 is about text extraction. But you are correct, WinAnsiEncoding is hardcoded inside PDTrueTypeFont. For now PDFBox hasn't any support for Identiy-H as encoding when adding text.
          Hide
          Thanos Agelatos added a comment -

          Andreas,
          thank you for the reply. I assumed that since code affected from PDFBOX-654 does some Identity-H parsing from the PDF the opposite could achieve what is requested here. Anyways, do you have some planning on when this feature will be come available? Limiting to WinAnsi is taking out many of the languages that we want PDFs generated for.

          thanks in advance
          Thanos

          Show
          Thanos Agelatos added a comment - Andreas, thank you for the reply. I assumed that since code affected from PDFBOX-654 does some Identity-H parsing from the PDF the opposite could achieve what is requested here. Anyways, do you have some planning on when this feature will be come available? Limiting to WinAnsi is taking out many of the languages that we want PDFs generated for. thanks in advance Thanos
          Hide
          Wolfgang Glas added a comment -

          I have implemented a glyph extractor, so that subfonts with less than or equal 256 glyphs may be extracted from a large TTF font.

          The code may be found there and is licensed under the terms of the apache licenese:

          http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/TTFSubFont.java

          We use the the code successfully to write PDFs with full-featured unicode strings by splitting the TTF font to smaller subfonts.

          Show
          Wolfgang Glas added a comment - I have implemented a glyph extractor, so that subfonts with less than or equal 256 glyphs may be extracted from a large TTF font. The code may be found there and is licensed under the terms of the apache licenese: http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/TTFSubFont.java We use the the code successfully to write PDFs with full-featured unicode strings by splitting the TTF font to smaller subfonts.
          Hide
          Charlie B added a comment -

          Wow. Great timing .. found it hard to believe that this issue wasn't getting more traction. I recently discovered PDFBox and coded up a prototype for generating docs ...only to find this show stopper.

          Wolfgang could you give a bit more detail on how you use the extractor? Looks like in EntitiesPdfRenderer .. you have a getSubFont(), but I'm not quite sure how to apply the subfonts.

          Also, is there any plan at all to support full TTFs in PDFBox proper?

          Show
          Charlie B added a comment - Wow. Great timing .. found it hard to believe that this issue wasn't getting more traction. I recently discovered PDFBox and coded up a prototype for generating docs ...only to find this show stopper. Wolfgang could you give a bit more detail on how you use the extractor? Looks like in EntitiesPdfRenderer .. you have a getSubFont(), but I'm not quite sure how to apply the subfonts. Also, is there any plan at all to support full TTFs in PDFBox proper?
          Hide
          Andreas Lehmkühler added a comment -

          @Wolfgang
          Sounds good to me, but I can't find any license information neither as header nor somewhere on the website. Can you somehow add those information, just for the record?

          Show
          Andreas Lehmkühler added a comment - @Wolfgang Sounds good to me, but I can't find any license information neither as header nor somewhere on the website. Can you somehow add those information, just for the record?
          Hide
          Wolfgang Glas added a comment -

          Andreas, I've added the apache licensing terms to the TTFSubFont file.

          @Charlie: Inside

          http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/EntitiesPdfRenderer.java

          You find an example on how to construct 256 glyph subfonts out of a large uncode-support font.

          http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/PdfRenderContextImpl.java

          gives you the code to draw a string composed out of multiple unicode blocks.

          Please note, that you have to set the /Length1 property of the embedded TTF font stream. (Therefore I introduced the new PDTrueTypeFont.loadFont(PDStream, Encoding) method, note the PDStream argument...)

          Furthermore, I wrote my own accessor interface to adobe's glyphlist, because the pdfbox API inside pdfbox's Encoding is not optimal. (unicode code points are not represented as int's, no static accessor to the parsed glyph list...)

          And yes, I'd really like to see this integrated into pdfbox, but as I pointed out it will need some finetuning, OpenType support, testing etc...

          My class is geared towards Microsoft's core fonts and not more.

          Best regards, Wolfgang

          Show
          Wolfgang Glas added a comment - Andreas, I've added the apache licensing terms to the TTFSubFont file. @Charlie: Inside http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/EntitiesPdfRenderer.java You find an example on how to construct 256 glyph subfonts out of a large uncode-support font. http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/PdfRenderContextImpl.java gives you the code to draw a string composed out of multiple unicode blocks. Please note, that you have to set the /Length1 property of the embedded TTF font stream. (Therefore I introduced the new PDTrueTypeFont.loadFont(PDStream, Encoding) method, note the PDStream argument...) Furthermore, I wrote my own accessor interface to adobe's glyphlist, because the pdfbox API inside pdfbox's Encoding is not optimal. (unicode code points are not represented as int's, no static accessor to the parsed glyph list...) And yes, I'd really like to see this integrated into pdfbox, but as I pointed out it will need some finetuning, OpenType support, testing etc... My class is geared towards Microsoft's core fonts and not more. Best regards, Wolfgang
          Hide
          Charlie B added a comment -

          @Wolfgang

          Thanks for the pointers ... after spending some time with the code I think I get the patterns - very cool work ... are you willing to share the mods you've made PDFBox (new loadFont method and Encoding changes)?

          Show
          Charlie B added a comment - @Wolfgang Thanks for the pointers ... after spending some time with the code I think I get the patterns - very cool work ... are you willing to share the mods you've made PDFBox (new loadFont method and Encoding changes)?
          Hide
          Wolfgang Glas added a comment -

          Charlie,

          I forgot to mention, that my work is based on the patch attached to PDFBOX-954.
          There you will find the improved loadFont() method and the according Encoding changes.

          Andreas and I have arranged a metting next week, where we will discuss on how to integrate my patch into pdfbox. Furthermore, we will work out a way on how to further improve the TTF-Unicode suppor. Surely, we will report our findings to the mailinglist and open subsequent jira issues as required.

          Wolfgang

          Show
          Wolfgang Glas added a comment - Charlie, I forgot to mention, that my work is based on the patch attached to PDFBOX-954 . There you will find the improved loadFont() method and the according Encoding changes. Andreas and I have arranged a metting next week, where we will discuss on how to integrate my patch into pdfbox. Furthermore, we will work out a way on how to further improve the TTF-Unicode suppor. Surely, we will report our findings to the mailinglist and open subsequent jira issues as required. Wolfgang
          Hide
          Charlie B added a comment -

          Hi Wolfgang, Andreas,

          Wondering how your meeting went. Before I start customizing for TTF-Unicode writing I'd like to know more about any plans for productizing.

          Thanks!

          Show
          Charlie B added a comment - Hi Wolfgang, Andreas, Wondering how your meeting went. Before I start customizing for TTF-Unicode writing I'd like to know more about any plans for productizing. Thanks!
          Hide
          Wolfgang Glas added a comment -

          Hi Charlie,

          Sorry for coming up so late, stuffed with work here...

          Basically, Andreas and I agreed in introducing a unicode-aware showtext-API in pdfbox-2.0. I will announce plans on the mailinglist and create issues likewise, when the dust on my desk settles,

          Best regards, Wolfgang

          Show
          Wolfgang Glas added a comment - Hi Charlie, Sorry for coming up so late, stuffed with work here... Basically, Andreas and I agreed in introducing a unicode-aware showtext-API in pdfbox-2.0. I will announce plans on the mailinglist and create issues likewise, when the dust on my desk settles, Best regards, Wolfgang
          Hide
          Charlie B added a comment -

          Hi Wolfgang, Andreas,

          Again I'm wondering if you have any solid plans for unicode text API in the near future?

          Thanks for any info or ETA on 2.0.

          Show
          Charlie B added a comment - Hi Wolfgang, Andreas, Again I'm wondering if you have any solid plans for unicode text API in the near future? Thanks for any info or ETA on 2.0.
          Hide
          Dinko Ivanov added a comment -

          Hello Andreas,

          We need to export Cyrillic content in PDF files. We've already invested significant effort in facilitating PDFBox for our needs and would like to somehow workaround this problem.
          Do you have any update on plans for including this feature in PDFBox?

          @Wolfgang: Could you share some more details/basic steps on how the solution in Sketch framework could be reused?
          I tried a simple scenario (export Drawing containing Cyrillic symbols to PDF), but without success. I think I'm missing something.

          Thanks and regards,
          Dinko

          Show
          Dinko Ivanov added a comment - Hello Andreas, We need to export Cyrillic content in PDF files. We've already invested significant effort in facilitating PDFBox for our needs and would like to somehow workaround this problem. Do you have any update on plans for including this feature in PDFBox? @Wolfgang: Could you share some more details/basic steps on how the solution in Sketch framework could be reused? I tried a simple scenario (export Drawing containing Cyrillic symbols to PDF), but without success. I think I'm missing something. Thanks and regards, Dinko
          Hide
          Charlie B added a comment -

          Hi gang,

          Any update here? Even a very loose idea of timing would be very helpful.

          Thanks,

          • Charlie
          Show
          Charlie B added a comment - Hi gang, Any update here? Even a very loose idea of timing would be very helpful. Thanks, Charlie
          Hide
          Wolfgang Glas added a comment -

          Hi Charlie,

          I have bad news for you. We have an an enormous struggle to get our projects done this year and I really do not have capacities to dive into pdfbox any deeper. I can answer any kind of questions, if somebody steps up to do the implementation of a wider unicode support in pdfbox's writing API, but I cannot do the implementation and testing, sorry for that.

          I'd really love to come to ApacheCon in Sinsheim, but we do need our projects done, that's life

          Best regards, Wolfgang

          Show
          Wolfgang Glas added a comment - Hi Charlie, I have bad news for you. We have an an enormous struggle to get our projects done this year and I really do not have capacities to dive into pdfbox any deeper. I can answer any kind of questions, if somebody steps up to do the implementation of a wider unicode support in pdfbox's writing API, but I cannot do the implementation and testing, sorry for that. I'd really love to come to ApacheCon in Sinsheim, but we do need our projects done, that's life Best regards, Wolfgang
          Hide
          Andreas Lehmkühler added a comment -

          As a first step I added TTFSubFont support in revision 1413777 based on Wolfgang Glas code.

          Show
          Andreas Lehmkühler added a comment - As a first step I added TTFSubFont support in revision 1413777 based on Wolfgang Glas code.
          Hide
          Antti Lankila added a comment - - edited

          I'm no expert with PDF, but I looked into the problem yesterday and this morning, and came up with this.

          Candidate specification for Unicode text writing support

          1. Each TTF font, when loaded, will be embedded as stream in the document. Two font descriptors will be created per call:

          • TTF descriptor itself
          • CIDFont Type 2 descriptor, which will be referenced by TTF

          2. CMap maps from character code to character id (CID). COSString will write unicode strings when required, and it's probably simplest if the CIDs are also just unicode codepoints.

          • Encoding will be Identity-H.
          • To support copy-paste, the ToUnicode table needs to be provided, and is also identity map.

          3. Character id is mapped to glyph id (GID). There are actually two major CIDFont types:

          • CIDFont Type 0: contains CFF or OpenType fonts that have intrinsic CID->glyph mapping.
            • this presumably means that the CIDs are font-specific and therefore CMap must be read from a font to support Type 0.
          • CIDFont Type 2: contains TrueType fonts which must have a CIDToGIDMap that declares how to map from CID to GID.
            • TTF files will probably have a Windows platform Unicode encoding, which is the unicode codepoint to glyph id map, and thus the CIDToGIDMap we must write. The map can be streamed and compressed and should not take much space.
          Consequences of the design
          • PDF as a document will be remarkably readable, though COSString tends to use hexadecimal format way too often. (Bug to be fixed? I feel that COSString should be based on chars (e.g. StringBuilder), not bytes (ByteArrayOutputStream).)
          • design is relatively simple; the hard work will be writing the CIDToGIDMap table, but this will be based on the Windows Unicode encoding table in TTF and should be trivial to generate.
          • fonts will have all of their characters embedded in the PDF

          I can't promise when I have time to implement this, but as far as I understand it, something like this is what it takes.

          Show
          Antti Lankila added a comment - - edited I'm no expert with PDF, but I looked into the problem yesterday and this morning, and came up with this. Candidate specification for Unicode text writing support 1. Each TTF font, when loaded, will be embedded as stream in the document. Two font descriptors will be created per call: TTF descriptor itself CIDFont Type 2 descriptor, which will be referenced by TTF 2. CMap maps from character code to character id (CID). COSString will write unicode strings when required, and it's probably simplest if the CIDs are also just unicode codepoints. Encoding will be Identity-H. To support copy-paste, the ToUnicode table needs to be provided, and is also identity map. 3. Character id is mapped to glyph id (GID). There are actually two major CIDFont types: CIDFont Type 0: contains CFF or OpenType fonts that have intrinsic CID->glyph mapping. this presumably means that the CIDs are font-specific and therefore CMap must be read from a font to support Type 0. CIDFont Type 2: contains TrueType fonts which must have a CIDToGIDMap that declares how to map from CID to GID. TTF files will probably have a Windows platform Unicode encoding, which is the unicode codepoint to glyph id map, and thus the CIDToGIDMap we must write. The map can be streamed and compressed and should not take much space. Consequences of the design PDF as a document will be remarkably readable, though COSString tends to use hexadecimal format way too often. (Bug to be fixed? I feel that COSString should be based on chars (e.g. StringBuilder), not bytes (ByteArrayOutputStream).) design is relatively simple; the hard work will be writing the CIDToGIDMap table, but this will be based on the Windows Unicode encoding table in TTF and should be trivial to generate. fonts will have all of their characters embedded in the PDF I can't promise when I have time to implement this, but as far as I understand it, something like this is what it takes.
          Hide
          John Hewson added a comment -

          Antti, you're along the right lines, it's actually simpler than that. You don't need a CIDToGIDMap, it can just be the name Identity, in which case CID = GID for all characters in the font, which means you can use the raw GIDs in the ToUnicode map and in COSStrings when they are written to the content stream.

          You'll need to subset the TTF fonts because they're usually quite large, but PDFBox has some code for doing this already.

          Show
          John Hewson added a comment - Antti, you're along the right lines, it's actually simpler than that. You don't need a CIDToGIDMap , it can just be the name Identity , in which case CID = GID for all characters in the font, which means you can use the raw GIDs in the ToUnicode map and in COSStrings when they are written to the content stream. You'll need to subset the TTF fonts because they're usually quite large, but PDFBox has some code for doing this already.
          Hide
          Antti Lankila added a comment -

          OK, but the fact is that I'd really prefer the COSString to be true unicode. That should be easy to interpret. and matches what the text encoding notionally should be (UTF-16BE when starting with the byte sequence 0xfe 0xff.)

          Show
          Antti Lankila added a comment - OK, but the fact is that I'd really prefer the COSString to be true unicode. That should be easy to interpret. and matches what the text encoding notionally should be (UTF-16BE when starting with the byte sequence 0xfe 0xff.)
          Hide
          John Hewson added a comment - - edited

          It shouldn't make any difference - the ToUnicode map defines the mapping to unicode, not the character codes embedded in the content stream. Of course there's no harm in using UTF-16 (when possible, see below) instead of PDFDocEncoding, but be aware that PDF readers use the ToUnicode map, as long as it's present.

          I should add that it isn't always possible to use Unicode for glyph encoding, because not every glyph has a a unique unicode point. For example, a font may include a set of normal characters and a set of small caps characters, but only one of these can map to the unicode "A" character. The other is forced to map to some other code, which is why GIDs are typically used with TrueType fonts, because we can guarantee that each glyph has a unique GID and the ToUnicode map can be used to map both the normal "A" and small cap "A" to Unicode "A".

          In other words: Unicode code point != glyph

          Show
          John Hewson added a comment - - edited It shouldn't make any difference - the ToUnicode map defines the mapping to unicode, not the character codes embedded in the content stream. Of course there's no harm in using UTF-16 (when possible, see below) instead of PDFDocEncoding, but be aware that PDF readers use the ToUnicode map, as long as it's present. I should add that it isn't always possible to use Unicode for glyph encoding, because not every glyph has a a unique unicode point. For example, a font may include a set of normal characters and a set of small caps characters, but only one of these can map to the unicode "A" character. The other is forced to map to some other code, which is why GIDs are typically used with TrueType fonts, because we can guarantee that each glyph has a unique GID and the ToUnicode map can be used to map both the normal "A" and small cap "A" to Unicode "A". In other words: Unicode code point != glyph
          Hide
          Antti Lankila added a comment -

          Well it makes the difference that when you construct a COSString, the default approach is to either render it as unicode or ascii. So UTF-16BE seems like the path of least resistance, not to mention that I like it for the reason that it's a defined standard and should follow the principle of least astonishment. As I mention above, I'm not very happy about COSString. I think it should be based on some character abstraction, rather than byte stream.

          Show
          Antti Lankila added a comment - Well it makes the difference that when you construct a COSString, the default approach is to either render it as unicode or ascii. So UTF-16BE seems like the path of least resistance, not to mention that I like it for the reason that it's a defined standard and should follow the principle of least astonishment. As I mention above, I'm not very happy about COSString. I think it should be based on some character abstraction, rather than byte stream.
          Hide
          Antti Lankila added a comment -

          OK. I got something that probably qualifies for the worst possible implementation of Unicode text writing in a PDF generation library in the entire history of mankind. Consider this an early preview.

          All that matters is that I did see this pile of garbage spit out unicode text when used with TTF font that has Windows platform Unicode encoding CMAP table.

          To use it:

          • PDType0Font.loadTTF() is a new method that generates a Type0 font with CIDFont Type2 hanging from it. The old TrueTypeFont.loadTTF is still usable, but you won't get Unicode text capabilities.
          • PDContentStream has a new method, drawUnicodeString(), which must be used when drawing text using a CID font. This generates the required 16-bit strings into the document.

          It turns out that whenever a CID font is used, all text strings meant to be printed will be read as 16-bit big-endian values. So there's no point to mess with PDFDocEncoding or UTF-16BE COSString or any of that stuff – drawing strings on page is a fundamentally special operation which depends entirely on the font being used.

          IMHO, the PDPageContentStream drawString should always be provided with the font that is currently being used for drawing so it could ask that font for instructions on how to correctly express the various glyphs.

          Show
          Antti Lankila added a comment - OK. I got something that probably qualifies for the worst possible implementation of Unicode text writing in a PDF generation library in the entire history of mankind. Consider this an early preview. All that matters is that I did see this pile of garbage spit out unicode text when used with TTF font that has Windows platform Unicode encoding CMAP table. To use it: PDType0Font.loadTTF() is a new method that generates a Type0 font with CIDFont Type2 hanging from it. The old TrueTypeFont.loadTTF is still usable, but you won't get Unicode text capabilities. PDContentStream has a new method, drawUnicodeString(), which must be used when drawing text using a CID font. This generates the required 16-bit strings into the document. It turns out that whenever a CID font is used, all text strings meant to be printed will be read as 16-bit big-endian values. So there's no point to mess with PDFDocEncoding or UTF-16BE COSString or any of that stuff – drawing strings on page is a fundamentally special operation which depends entirely on the font being used. IMHO, the PDPageContentStream drawString should always be provided with the font that is currently being used for drawing so it could ask that font for instructions on how to correctly express the various glyphs.
          Hide
          Antti Lankila added a comment - - edited

          Version 2, more palatable.

          This one uses Identity-H for charcode -> CID, and Identity for CID -> GID, and then has a few hacks in PDFont (encodeCID never has worked as far as I can tell) and PDType0Font to make it work.

          I would have liked to use the fontbox's CMap facility to do the codepoint -> CID conversion, but I could not work out how the CMap stuff works. The method lookupCID() is a CID -> String conversion, apparently, lookup(byte[], int, int) does the reverse but it goes into some CodespaceRange check that is probably not 8-bit clean. I just gave up trying to figure out what this is supposed to be doing and just did a hashmap in PDType0Font to do the String->CID conversion.

          There are probably other issues remaining, like PDFont's getStringWidth() starts out via conversion to ISO-8859-1, which can't be correct.

          Show
          Antti Lankila added a comment - - edited Version 2, more palatable. This one uses Identity-H for charcode -> CID, and Identity for CID -> GID, and then has a few hacks in PDFont (encodeCID never has worked as far as I can tell) and PDType0Font to make it work. I would have liked to use the fontbox's CMap facility to do the codepoint -> CID conversion, but I could not work out how the CMap stuff works. The method lookupCID() is a CID -> String conversion, apparently, lookup(byte[], int, int) does the reverse but it goes into some CodespaceRange check that is probably not 8-bit clean. I just gave up trying to figure out what this is supposed to be doing and just did a hashmap in PDType0Font to do the String->CID conversion. There are probably other issues remaining, like PDFont's getStringWidth() starts out via conversion to ISO-8859-1, which can't be correct.
          Hide
          John Hewson added a comment - - edited

          If you're using Identity-H for charcode -> CID and Identity for CID -> GID you're not going to be able to subset the font because doing so will change the GIDs and therefore the CIDs and therefore the charcodes (which will also no longer be the correct Unicode points).

          Show
          John Hewson added a comment - - edited If you're using Identity-H for charcode -> CID and Identity for CID -> GID you're not going to be able to subset the font because doing so will change the GIDs and therefore the CIDs and therefore the charcodes (which will also no longer be the correct Unicode points).
          Hide
          Antti Lankila added a comment -

          I do not really understand what makes you say that. Isn't subsetted font basically just a wholly different font file, just having a bunch of glyphs removed from the original one? For instance, assuming it is a TTF file, you drop bunch of glyphs and then update the cmaps to reference the appropriate glyph indexes, and then you have a new TTF file. If so, I can't see the problem because you are providing all the same information as with the original font, only with less glyphs included.

          On the other hand, I do understand that if you write the text stream using encoding of one font, then change the definition of the TTF font without re-encoding the text, then you definitely run into problems. But the only possible way to keep CID stable is to define a standard for them, such as that CIDs are UCS-2. This can be done, but as far as I can tell this limits code points to the less than 0x10000 range because CID font writing writes 16 bit character indexes by definition, and there is no notion of the surrogate pairs of UTF-16. It might not be a real problem in practice, but it's nevertheless a limitation that the identity mapping for glyph indexes does not have. The only limitation of the latter approach is that single font can't have more than 65536 glyphs.

          BTW, I've been quiet on this front because I solved my immediate problem by switching to a PDF rendering library called jPod. It's not so advanced as pdfbox, and it didn't support unicode text either, but it was possible to get CID keyed fonts to work on it without touching the library itself, just through providing appropriate COS objects and setting up an encoding based on the font's Windows Unicode cmap. I even managed to set up a working copypaste by providing the ToUnicode postscript program, so I got everything working nicely using that 2008-era library, but I had to write most of the PDF object factories myself.

          Show
          Antti Lankila added a comment - I do not really understand what makes you say that. Isn't subsetted font basically just a wholly different font file, just having a bunch of glyphs removed from the original one? For instance, assuming it is a TTF file, you drop bunch of glyphs and then update the cmaps to reference the appropriate glyph indexes, and then you have a new TTF file. If so, I can't see the problem because you are providing all the same information as with the original font, only with less glyphs included. On the other hand, I do understand that if you write the text stream using encoding of one font, then change the definition of the TTF font without re-encoding the text, then you definitely run into problems. But the only possible way to keep CID stable is to define a standard for them, such as that CIDs are UCS-2. This can be done, but as far as I can tell this limits code points to the less than 0x10000 range because CID font writing writes 16 bit character indexes by definition, and there is no notion of the surrogate pairs of UTF-16. It might not be a real problem in practice, but it's nevertheless a limitation that the identity mapping for glyph indexes does not have. The only limitation of the latter approach is that single font can't have more than 65536 glyphs. BTW, I've been quiet on this front because I solved my immediate problem by switching to a PDF rendering library called jPod. It's not so advanced as pdfbox, and it didn't support unicode text either, but it was possible to get CID keyed fonts to work on it without touching the library itself, just through providing appropriate COS objects and setting up an encoding based on the font's Windows Unicode cmap. I even managed to set up a working copypaste by providing the ToUnicode postscript program, so I got everything working nicely using that 2008-era library, but I had to write most of the PDF object factories myself.
          Hide
          Antti Lankila added a comment - - edited

          Anyway, let's take a look at the changes required in PDFBox to get the text writing to work properly.

          • drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly. Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for "public byte[] encode(String)". I would suggest encoding displayable text always as (<hex chars>) sequences because this encoding is simplest to implement and the easiest to make bug free.
          • PDFont needs a clearly specified API which performs java String to font-specific encoding transformation. The process is usually called encoding, and yields a byte array, and the reverse process of taking a byte array and interpreting it to String is called decoding. Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte[], int int) performs decoding, so it should be renamed such. In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine.
          • When implementing encoding, never ask for the char[] array of a Java String. Instead, "for (int i = 0, cp; i < string.length(); i += Character.charCount(cp)) { cp = string.codePointAt(i); ... now encode the codepoint ... }

            ". This will handle the UTF-16 surrogate pairs correctly.

          • There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font. It seems that Type1, TrueType, Type3, and CIDType0 and CIDType2 fonts require different handling from each other. It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though.
          Show
          Antti Lankila added a comment - - edited Anyway, let's take a look at the changes required in PDFBox to get the text writing to work properly. drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly. Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for "public byte[] encode(String)". I would suggest encoding displayable text always as (<hex chars>) sequences because this encoding is simplest to implement and the easiest to make bug free. PDFont needs a clearly specified API which performs java String to font-specific encoding transformation. The process is usually called encoding, and yields a byte array, and the reverse process of taking a byte array and interpreting it to String is called decoding. Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte[], int int) performs decoding, so it should be renamed such. In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine. When implementing encoding, never ask for the char[] array of a Java String. Instead, "for (int i = 0, cp; i < string.length(); i += Character.charCount(cp)) { cp = string.codePointAt(i); ... now encode the codepoint ... } ". This will handle the UTF-16 surrogate pairs correctly. There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font. It seems that Type1, TrueType, Type3, and CIDType0 and CIDType2 fonts require different handling from each other. It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though.
          Hide
          John Hewson added a comment - - edited

          I do not really understand what makes you say that. Isn't subsetted font basically just a wholly different font file, just having a bunch of glyphs removed from the original one? For instance, assuming it is a TTF file, you drop bunch of glyphs and then update the cmaps to reference the appropriate glyph indexes, and then you have a new TTF file. If so, I can't see the problem because you are providing all the same information as with the original font, only with less glyphs included.

          You said that you were using "Identity-H for charcode -> CID, and Identity for CID -> GID", which doesn't involve updating any cmaps. If you remove glyphs from a font then the GIDs will change, and if you're using an Identity cmap then your CIDs will by definition change also. But now you mention "update the cmaps", which isn't going to be an Identity cmap any more... so actually you're not wanting to use an Identity cmap.

          On the other hand, I do understand that if you write the text stream using encoding of one font, then change the definition of the TTF font without re-encoding the text, then you definitely run into problems. But the only possible way to keep CID stable is to define a standard for them, such as that CIDs are UCS-2

          Not necessarily, you could use a CIDToGIDMap which initially is an identity mapping but which is updated to reflect the new GIDs once the font is subset - that's a pretty good approach.

          This can be done, but as far as I can tell this limits code points to the less than 0x10000 range because CID font writing writes 16 bit character indexes by definition, and there is no notion of the surrogate pairs of UTF-16. It might not be a real problem in practice, but it's nevertheless a limitation that the identity mapping for glyph indexes does not have. The only limitation of the latter approach is that single font can't have more than 65536 glyphs.

          You had said that you wanted to use "identity CID -> GID" but you're going to need a font with tens of thousands of empty glyphs in order to have that CID also be a valid Unicode point... not what you want.

          Show
          John Hewson added a comment - - edited I do not really understand what makes you say that. Isn't subsetted font basically just a wholly different font file, just having a bunch of glyphs removed from the original one? For instance, assuming it is a TTF file, you drop bunch of glyphs and then update the cmaps to reference the appropriate glyph indexes, and then you have a new TTF file. If so, I can't see the problem because you are providing all the same information as with the original font, only with less glyphs included. You said that you were using "Identity-H for charcode -> CID, and Identity for CID -> GID", which doesn't involve updating any cmaps. If you remove glyphs from a font then the GIDs will change, and if you're using an Identity cmap then your CIDs will by definition change also. But now you mention "update the cmaps", which isn't going to be an Identity cmap any more... so actually you're not wanting to use an Identity cmap. On the other hand, I do understand that if you write the text stream using encoding of one font, then change the definition of the TTF font without re-encoding the text, then you definitely run into problems. But the only possible way to keep CID stable is to define a standard for them, such as that CIDs are UCS-2 Not necessarily, you could use a CIDToGIDMap which initially is an identity mapping but which is updated to reflect the new GIDs once the font is subset - that's a pretty good approach. This can be done, but as far as I can tell this limits code points to the less than 0x10000 range because CID font writing writes 16 bit character indexes by definition, and there is no notion of the surrogate pairs of UTF-16. It might not be a real problem in practice, but it's nevertheless a limitation that the identity mapping for glyph indexes does not have. The only limitation of the latter approach is that single font can't have more than 65536 glyphs. You had said that you wanted to use "identity CID -> GID" but you're going to need a font with tens of thousands of empty glyphs in order to have that CID also be a valid Unicode point... not what you want.
          Hide
          John Hewson added a comment - - edited

          drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly.

          Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by whether or not it starts with a BOM, not which font it uses the current font's CMap but is always 16-bits with TTF.

          Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for "public byte[] encode(String)".

          drawString() is only valid after setFont() has been called, so it doesn't need adding to the API, we can just use the current font. PDFont#encode is a good idea, yes.

          PDFont needs a clearly specified API which performs java String to font-specific encoding transformation.

          Yes, as above.

          Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte[], int int) performs decoding, so it should be renamed such.

          Yes, I don't know if anybody knows what those methods are actually doing, including the original author.

          In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine.

          processEncodedText() is indeed crazy and needs fixing, but what you propose won't work because the 16-bit string encoding is not set by the font, it's set on a per-string basis by having that string start with a BOM.

          There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font.

          Yes, though as far as decoding the correct text is concerned all you have to do is make sure that the ToUnicode map is built correctly - you can put any old garbage in the actual strings (any many PDFs do).

          It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though.

          Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. Type1C) only.

          Show
          John Hewson added a comment - - edited drawString() in PDPageContentStream just writes the text into PDF as any COSString would choose to represent it. This is not the right thing to do. When the font is a CID keyed font, every glyph is 16 bit wide by definition, and COSString won't necessarily notice and write it correctly. Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by whether or not it starts with a BOM, not which font it uses the current font's CMap but is always 16-bits with TTF. Therefore, drawString() must know what font is currently being drawn, and ask that font to encode the String to whatever byte sequence it takes to draw those glyphs. So, PDFont must be added to the drawString() API, and PDFont ought to have a method for "public byte[] encode(String)". drawString() is only valid after setFont() has been called, so it doesn't need adding to the API, we can just use the current font. PDFont#encode is a good idea, yes. PDFont needs a clearly specified API which performs java String to font-specific encoding transformation. Yes, as above. Observe that there are no methods in PDFont called decode(), and I have a hard time figuring out what any one of these methods actually do, because everything seems to be called "encode" or "lookup". It seems that the encode(byte[], int int) performs decoding, so it should be renamed such. Yes, I don't know if anybody knows what those methods are actually doing, including the original author. In general I'd recommend pushing the encode/decode job down to the font layer. Provide just two methods: "byte[] encode(String)" and "String decode(byte[])". Their job is to convert between the byte sequences required by that font and java Strings, and they handle full runs of text, not just single characters. They will then use single- or multibyte encodings as the font requires without the higher level having to do crazy stuff like processEncodedText() currently does in PDFStreamEngine. processEncodedText() is indeed crazy and needs fixing, but what you propose won't work because the 16-bit string encoding is not set by the font, it's set on a per-string basis by having that string start with a BOM. There are unfortunately very many ways to encode text in PDF, and especially if text needs to be decodable from the byte stream generated by other programs, the full complexity must be faced and implemented. These are to be solved in a case-by-case basis in the PDFont hierarchy. The PDFont highest class methods for encode and decode should be defined as abstract to reflect the fact that encoding depends on the particular subtype of the font. Yes, though as far as decoding the correct text is concerned all you have to do is make sure that the ToUnicode map is built correctly - you can put any old garbage in the actual strings (any many PDFs do). It may be that for some of these fonts the implementation is same because the actual mechanics can be handled by varying the Encoding instance, though. Maybe, though the Encoding class is for Type1 fonts (and equivalent, e.g. Type1C) only.
          Hide
          Antti Lankila added a comment -

          Going to combine two posts into one...

          "You said that you were using "Identity-H for charcode -> CID, and Identity for CID -> GID", which doesn't involve updating any cmaps."

          Ah. I meant the cmap table in TTF actually. They do have cmaps which map from some specific encoding's values to glyph indexes. I can understand that my phrasing was confusing.

          Full ack on the CIDToGIDMap approach. That is a way to allow manipulating a font without having to re-encode text already written with the font.

          There must be some confusion about the 0x10000 CID limit. I simply meant that assuming a font contains a glyph which has unicode codepoint above 0x10000, it follows that rendering that glyph requires the CIDs to not be treated as UCS-2 values, because there is no way to represent that codepoint in UCS-2. I was mostly trying to weigh between different alternatives. I still like identity mappings because that means that conversion from unicode to appropriate GID is the simplest possible, at least for TTF fonts with Windows Unicode cmap table.

          On to the next one...

          "Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by whether or not it starts with a BOM, not which font it uses."

          In my experience this is not the case. I know the standard says that PDF String encoding is controlled by a BOM appearing at the beginning, but this probably refers to other kinds of text, not the kind of text you print on a page! For instance, according to my testing, if you actually write text in CID keyed font, your BOM will be treated as CID and mapped to a character – or if you try to write with a font that is defined to have 8-bit characters, prepending it with a BOM just generates the BOM's characters in the text. It was this latter behavior that I spotted originally – I tried to generate the three dots ("…") character with PDFont.HELVETICA, and saw the BOM characters appear in the text string, along with extra spaces between glyphs that were the null bytes in UTF-16 encoding.

          Show
          Antti Lankila added a comment - Going to combine two posts into one... "You said that you were using "Identity-H for charcode -> CID, and Identity for CID -> GID", which doesn't involve updating any cmaps." Ah. I meant the cmap table in TTF actually. They do have cmaps which map from some specific encoding's values to glyph indexes. I can understand that my phrasing was confusing. Full ack on the CIDToGIDMap approach. That is a way to allow manipulating a font without having to re-encode text already written with the font. There must be some confusion about the 0x10000 CID limit. I simply meant that assuming a font contains a glyph which has unicode codepoint above 0x10000, it follows that rendering that glyph requires the CIDs to not be treated as UCS-2 values, because there is no way to represent that codepoint in UCS-2. I was mostly trying to weigh between different alternatives. I still like identity mappings because that means that conversion from unicode to appropriate GID is the simplest possible, at least for TTF fonts with Windows Unicode cmap table. On to the next one... "Not quite: every CID can be up to 16-bits wide, but many (or for < 256 glyphs, all) will fit inside 8 bits. The byte-width of a string is controlled by whether or not it starts with a BOM, not which font it uses." In my experience this is not the case. I know the standard says that PDF String encoding is controlled by a BOM appearing at the beginning, but this probably refers to other kinds of text, not the kind of text you print on a page! For instance, according to my testing, if you actually write text in CID keyed font, your BOM will be treated as CID and mapped to a character – or if you try to write with a font that is defined to have 8-bit characters, prepending it with a BOM just generates the BOM's characters in the text. It was this latter behavior that I spotted originally – I tried to generate the three dots ("…") character with PDFont.HELVETICA, and saw the BOM characters appear in the text string, along with extra spaces between glyphs that were the null bytes in UTF-16 encoding.
          Hide
          John Hewson added a comment - - edited

          I meant the cmap table in TTF actually. They do have cmaps which map from some specific encoding's values to glyph indexes. I can understand that my phrasing was confusing.

          Ok, that makes more sense! When the font is subset the cmap table will get rewritten, but that's not going to be a problem. It's basically internal to the font.

          There must be some confusion about the 0x10000 CID limit. I simply meant that assuming a font contains a glyph which has unicode codepoint above 0x10000, it follows that rendering that glyph requires the CIDs to not be treated as UCS-2 values, because there is no way to represent that codepoint in UCS-2. I was mostly trying to weigh between different alternatives. I still like identity mappings because that means that conversion from unicode to appropriate GID is the simplest possible, at least for TTF fonts with Windows Unicode cmap table.

          Perhaps we're making the same observation: that CIDs can't be used to represent all Unicode points, so identity mapping breaks at some point. The reason you can't really do an identity mapping to GID is that GID is the index of the glyph in the font, so if you had a font with a single Unicode character, say U+2265, you'd need 8,804 empty glyphs in the font prior to it. You can however do an identity mapping if you are willing to use GIDs in your strings but you'd need to re-encode your strings after subsetting the font in order to do this, which is a major hassle.

          I know the standard says that PDF String encoding is controlled by a BOM appearing at the beginning, but this probably refers to other kinds of text, not the kind of text you print on a page! For instance, according to my testing, if you actually write text in CID keyed font, your BOM will be treated as CID and mapped to a character – or if you try to write with a font that is defined to have 8-bit characters, prepending it with a BOM just generates the BOM's characters in the text. It was this latter behavior that I spotted originally – I tried to generate the three dots ("…") character with PDFont.HELVETICA, and saw the BOM characters appear in the text string, along with extra spaces between glyphs that were the null bytes in UTF-16 encoding.

          Yeah, looking at the spec you're right that the BOM doesn't apply to content stream text - I hadn't realised that. However, it seems that composite fonts can use encodings are not fixed to 16-bit:

          "When the current font is composite, the text-showing operators shall behave differently than with simple fonts. For simple fonts, each byte of a string to be shown selects one glyph, whereas for composite fonts, a sequence of one or more bytes are decoded to select a glyph from the descendant CIDFont."

          It looks like the (PDF) CMap controls the code length:

          "The codespace ranges in the CMap (delimited by begincodespacerange and endcodespacerange) specify how many bytes are extracted from the string for each successive character code. A codespace range shall be specified by a pair of codes of some particular length giving the lower and upper bounds of that range. A code shall be considered to match the range if it is the same length as the bounding codes and the value of each of its bytes lies between the corresponding bytes of the lower and upper bounds. The code length shall not be greater than 4."

          I guess we just always generate 16-bit CMaps for composite fonts and be done with it.

          Show
          John Hewson added a comment - - edited I meant the cmap table in TTF actually. They do have cmaps which map from some specific encoding's values to glyph indexes. I can understand that my phrasing was confusing. Ok, that makes more sense! When the font is subset the cmap table will get rewritten, but that's not going to be a problem. It's basically internal to the font. There must be some confusion about the 0x10000 CID limit. I simply meant that assuming a font contains a glyph which has unicode codepoint above 0x10000, it follows that rendering that glyph requires the CIDs to not be treated as UCS-2 values, because there is no way to represent that codepoint in UCS-2. I was mostly trying to weigh between different alternatives. I still like identity mappings because that means that conversion from unicode to appropriate GID is the simplest possible, at least for TTF fonts with Windows Unicode cmap table. Perhaps we're making the same observation: that CIDs can't be used to represent all Unicode points, so identity mapping breaks at some point. The reason you can't really do an identity mapping to GID is that GID is the index of the glyph in the font, so if you had a font with a single Unicode character, say U+2265, you'd need 8,804 empty glyphs in the font prior to it. You can however do an identity mapping if you are willing to use GIDs in your strings but you'd need to re-encode your strings after subsetting the font in order to do this, which is a major hassle. I know the standard says that PDF String encoding is controlled by a BOM appearing at the beginning, but this probably refers to other kinds of text, not the kind of text you print on a page! For instance, according to my testing, if you actually write text in CID keyed font, your BOM will be treated as CID and mapped to a character – or if you try to write with a font that is defined to have 8-bit characters, prepending it with a BOM just generates the BOM's characters in the text. It was this latter behavior that I spotted originally – I tried to generate the three dots ("…") character with PDFont.HELVETICA, and saw the BOM characters appear in the text string, along with extra spaces between glyphs that were the null bytes in UTF-16 encoding. Yeah, looking at the spec you're right that the BOM doesn't apply to content stream text - I hadn't realised that. However, it seems that composite fonts can use encodings are not fixed to 16-bit: "When the current font is composite, the text-showing operators shall behave differently than with simple fonts. For simple fonts, each byte of a string to be shown selects one glyph, whereas for composite fonts, a sequence of one or more bytes are decoded to select a glyph from the descendant CIDFont." It looks like the (PDF) CMap controls the code length: "The codespace ranges in the CMap (delimited by begincodespacerange and endcodespacerange) specify how many bytes are extracted from the string for each successive character code. A codespace range shall be specified by a pair of codes of some particular length giving the lower and upper bounds of that range. A code shall be considered to match the range if it is the same length as the bounding codes and the value of each of its bytes lies between the corresponding bytes of the lower and upper bounds. The code length shall not be greater than 4." I guess we just always generate 16-bit CMaps for composite fonts and be done with it.
          Hide
          Antti Lankila added a comment -

          Ah... there are multiple ways to understand what "identity mapping" meant. I've been using it in sense that PDF standard uses: that Identity means f = x, and that implies that once you have CIDToGIDMap as Identity and Encoding as Identity-H, then all the character codes and CIDs are just GIDs. When I discuss about the possibility that CID values would be constrained to be valid Unicode code points, I use some phrasing such as "CIDs are UCS-2". In this case, of course, we would still have Identity-H mapping at the character code -> CID layer, but not at the CID to GID layer.

          I believe that the notion of subsetting fonts is not a problem as long as subsetting is not done after the fact by replacing the FontFile parameter. (Or if it is, then CIDToGIDMap must be provided that matches the new glyph IDs, as you pointed out.)

          Of course, this only applies to truetype fonts. Some font types apparently defined CIDs to have a particular meaning, and they come with their own CID to GID programs. I assume such fonts also provide a meaning for CID that we could use, such as the unicode value or postscript name for the CID, or some predefined encoding map that defines all valid CIDs and their interpretation.

          You are right that the CMap will control the code length. I also can't see any good reason to generate but 16-bit characters – all that matters is that indexing all the glyphs is possible and I'm going to guess that there are no non-composite fonts that have more than 65536 glyphs, so that makes things simple on the generating side. However, existing PDF files could have combined single/multibyte CMaps, which are then required to have no possibility to confuse which CMap is in use so the ranges going for 8-bit codes can't be used as the prefix for 16-bit codes, and so on. Rather complicated and I doubt that the current code (which is also pretty ugly to look at) is handling things correctly – CodespaceRanges are not sorted by length as far as I can see.

          Show
          Antti Lankila added a comment - Ah... there are multiple ways to understand what "identity mapping" meant. I've been using it in sense that PDF standard uses: that Identity means f = x, and that implies that once you have CIDToGIDMap as Identity and Encoding as Identity-H, then all the character codes and CIDs are just GIDs. When I discuss about the possibility that CID values would be constrained to be valid Unicode code points, I use some phrasing such as "CIDs are UCS-2". In this case, of course, we would still have Identity-H mapping at the character code -> CID layer, but not at the CID to GID layer. I believe that the notion of subsetting fonts is not a problem as long as subsetting is not done after the fact by replacing the FontFile parameter. (Or if it is, then CIDToGIDMap must be provided that matches the new glyph IDs, as you pointed out.) Of course, this only applies to truetype fonts. Some font types apparently defined CIDs to have a particular meaning, and they come with their own CID to GID programs. I assume such fonts also provide a meaning for CID that we could use, such as the unicode value or postscript name for the CID, or some predefined encoding map that defines all valid CIDs and their interpretation. You are right that the CMap will control the code length. I also can't see any good reason to generate but 16-bit characters – all that matters is that indexing all the glyphs is possible and I'm going to guess that there are no non-composite fonts that have more than 65536 glyphs, so that makes things simple on the generating side. However, existing PDF files could have combined single/multibyte CMaps, which are then required to have no possibility to confuse which CMap is in use so the ranges going for 8-bit codes can't be used as the prefix for 16-bit codes, and so on. Rather complicated and I doubt that the current code (which is also pretty ugly to look at) is handling things correctly – CodespaceRanges are not sorted by length as far as I can see.
          Hide
          John Hewson added a comment -

          Yep, that's fine.

          Show
          John Hewson added a comment - Yep, that's fine.
          Hide
          Philip Helger added a comment -

          Hi!
          So you had a long discussion on the details. Is there any planned date for adding an implementation to PDFBox?
          Thanks, Philip

          Show
          Philip Helger added a comment - Hi! So you had a long discussion on the details. Is there any planned date for adding an implementation to PDFBox? Thanks, Philip
          Hide
          John Hewson added a comment - - edited

          Yes it's planned, but no there is no date. Currently the parsing/rendering aspects of PDFBox are taking up most of the committers' time, so this issue will move rather slowly.

          Now that PDFBOX-2262 and PDFBOX-2149 are complete, the main pieces are in place. Embedding TrueType fonts will usually involve subsetting which FontBox's TTFSubsetter should be able to do, but this is untested. Somehow the glyphs which have been written to the document will need to be tracked.

          Show
          John Hewson added a comment - - edited Yes it's planned, but no there is no date. Currently the parsing/rendering aspects of PDFBox are taking up most of the committers' time, so this issue will move rather slowly. Now that PDFBOX-2262 and PDFBOX-2149 are complete, the main pieces are in place. Embedding TrueType fonts will usually involve subsetting which FontBox's TTFSubsetter should be able to do, but this is untested. Somehow the glyphs which have been written to the document will need to be tracked.
          Hide
          Antti Lankila added a comment - - edited

          I remain mildly confused about the subsetting. Why not just embed the entire font and render it as CID keyed font? I have a (misnamed) attachment in SourceForge where I am hoping that next jPod release will incorporate some useful functions: http://sourceforge.net/p/jpodlib/patches/_discuss/thread/97a19659/a7dd/attachment/PDFBoxImprovements.java

          The Unicode support here works with the loadCIDFromTTF() method. It constructs the CID font and the unicode CMAP for copy-paste. Note that in jPod, fonts encode themselves through the mapping, the content stream generator calls font's Encoding's encode(String) method to generate byte sequences to embed into the document. This is the API that PDFBox must adopt, if it hasn't already. (PDFBox also wants a decode() method, I guess, but I did not provide one because it was not necessary for solving my immediate problem.)

          Show
          Antti Lankila added a comment - - edited I remain mildly confused about the subsetting. Why not just embed the entire font and render it as CID keyed font? I have a (misnamed) attachment in SourceForge where I am hoping that next jPod release will incorporate some useful functions: http://sourceforge.net/p/jpodlib/patches/_discuss/thread/97a19659/a7dd/attachment/PDFBoxImprovements.java The Unicode support here works with the loadCIDFromTTF() method. It constructs the CID font and the unicode CMAP for copy-paste. Note that in jPod, fonts encode themselves through the mapping, the content stream generator calls font's Encoding's encode(String) method to generate byte sequences to embed into the document. This is the API that PDFBox must adopt, if it hasn't already. (PDFBox also wants a decode() method, I guess, but I did not provide one because it was not necessary for solving my immediate problem.)
          Hide
          Philip Helger added a comment -

          Thanks for clarifying things.
          Font subsetting has the advantage, that the created PDF file would not be so large (from a practical point of view).
          Also iText and Word use subsetting.

          I'm eagerly awaiting the possibility to write Unicode text in a simple way to PDF

          Show
          Philip Helger added a comment - Thanks for clarifying things. Font subsetting has the advantage, that the created PDF file would not be so large (from a practical point of view). Also iText and Word use subsetting. I'm eagerly awaiting the possibility to write Unicode text in a simple way to PDF
          Hide
          John Hewson added a comment -

          A typical TTF is around 300KB for a single style, so a PDF using regular/bold/italic would be 900KB. It's much worse for Asian fonts which are 10-30MB per style. We already have a subsetter in FontBox which is currently unused, so presumably all PDFBox needs to do is to track which glyphs are used.

          Embedding the TTF as a CIDFont and building the ToUnicode CMap as you mention should be fairly simple. We do indeed need an #encode method on PDFont (or perhaps some builder class if we're doing subsetting). Decode is now provided in the form of PDFont#readCode, the result of which may be passed to PDFont#toUnicode.

          Show
          John Hewson added a comment - A typical TTF is around 300KB for a single style, so a PDF using regular/bold/italic would be 900KB. It's much worse for Asian fonts which are 10-30MB per style. We already have a subsetter in FontBox which is currently unused, so presumably all PDFBox needs to do is to track which glyphs are used. Embedding the TTF as a CIDFont and building the ToUnicode CMap as you mention should be fairly simple. We do indeed need an #encode method on PDFont (or perhaps some builder class if we're doing subsetting). Decode is now provided in the form of PDFont#readCode, the result of which may be passed to PDFont#toUnicode.
          Hide
          Andreas Lehmkühler added a comment -

          Embedding the TTF as a CIDFont and building the ToUnicode CMap as you mention should be fairly simple.

          Apache FOP is using CIDFonts, maybe we should have a look at their code

          Show
          Andreas Lehmkühler added a comment - Embedding the TTF as a CIDFont and building the ToUnicode CMap as you mention should be fairly simple. Apache FOP is using CIDFonts, maybe we should have a look at their code
          Hide
          John Hewson added a comment -
          Show
          John Hewson added a comment - Good idea, this code looks relevant .
          Hide
          ASF subversion and git services added a comment -

          Commit 1645068 from John Hewson in branch 'pdfbox/trunk'
          [ https://svn.apache.org/r1645068 ]

          PDFBOX-922: Encode content stream text using PDFont

          Show
          ASF subversion and git services added a comment - Commit 1645068 from John Hewson in branch 'pdfbox/trunk' [ https://svn.apache.org/r1645068 ] PDFBOX-922 : Encode content stream text using PDFont
          Hide
          John Hewson added a comment -

          I've added an encode() method to PDFont, as discussed. This is now used when writing strings to the content stream rather than ISO-8859-1. I've implemented this method for PDTrueTypeFont, PDType1Font, and PDCIDFontType2. Note that PDTrueTypeFont still hardcodes WinAnsiEncoding.

          Show
          John Hewson added a comment - I've added an encode() method to PDFont, as discussed. This is now used when writing strings to the content stream rather than ISO-8859-1. I've implemented this method for PDTrueTypeFont, PDType1Font, and PDCIDFontType2. Note that PDTrueTypeFont still hardcodes WinAnsiEncoding.
          Hide
          ASF subversion and git services added a comment -

          Commit 1645080 from John Hewson in branch 'pdfbox/trunk'
          [ https://svn.apache.org/r1645080 ]

          PDFBOX-922: Cleanly limit PDTrueTypeFont to WinAnsiEncoding

          Show
          ASF subversion and git services added a comment - Commit 1645080 from John Hewson in branch 'pdfbox/trunk' [ https://svn.apache.org/r1645080 ] PDFBOX-922 : Cleanly limit PDTrueTypeFont to WinAnsiEncoding
          Hide
          John Hewson added a comment -

          We now have support for embedding Type0/CIDFontType2 fonts, due to PDFBOX-2524. This provides full Unicode support for embedding TTF fonts via PDType0Font. We still need to build a ToUnicode CMap though, for copy & paste.

          I've kept PDTrueTypeFont's limit of only supporting WinAnsiEncoding, but cleaned up the code to throw an exception if text outside of that range is encoded. Simple fonts were not designed for use with Unicode, so this is probably for the best.

          Show
          John Hewson added a comment - We now have support for embedding Type0/CIDFontType2 fonts, due to PDFBOX-2524 . This provides full Unicode support for embedding TTF fonts via PDType0Font. We still need to build a ToUnicode CMap though, for copy & paste. I've kept PDTrueTypeFont's limit of only supporting WinAnsiEncoding, but cleaned up the code to throw an exception if text outside of that range is encoded. Simple fonts were not designed for use with Unicode, so this is probably for the best.
          Hide
          John Hewson added a comment -

          Building of ToUnicode CMaps was added in https://svn.apache.org/r1645083 as part of PDFBOX-2524. We now have full Unicode support for embedded TTFs using PDType0Font#load

          Show
          John Hewson added a comment - Building of ToUnicode CMaps was added in https://svn.apache.org/r1645083 as part of PDFBOX-2524 . We now have full Unicode support for embedded TTFs using PDType0Font#load
          Hide
          John Hewson added a comment -

          I've opened a follow-up issue, PDFBOX-2565, for subsetting the embedded TTF font file.

          Show
          John Hewson added a comment - I've opened a follow-up issue, PDFBOX-2565 , for subsetting the embedded TTF font file.

            People

            • Assignee:
              Unassigned
              Reporter:
              Thanos Agelatos
            • Votes:
              16 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development