PDFBox
  1. PDFBox
  2. PDFBOX-922

True type PDFont subclass only supports WinAnsiEncoding (hardcoded!)

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 1.3.1
    • Fix Version/s: None
    • Component/s: Writing
    • Labels:
      None
    • Environment:
      JDK 1.6 / OS irrelevant, tried against 1.3.1 and 1.2.0

      Description

      PDFBox cannot embed Identity-H or Identity-V type TTF fonts in the PDF it creates, making it impossible to create PDFs in any language apart from English and ones supported in WinAnsiEncoding. This behaviour is caused because method PDTrueTypeFont.loadTTF has hardcoded WinAnsiEncoding inside, and there is no Identity-H or Identity-V Encoding classes provided (to set afterwards via PDFont.setFont() )

      This excludes the following languages plus many others:

      • Greek
      • Bulgarian
      • Swedish
      • Baltic languages
      • Malteze

      The PDF created contains garbled characters and/or squares.

      Simple test case:

      PDDocument doc = null;
      try

      { doc = new PDDocument(); PDPage page = new PDPage(); doc.addPage(page); // extract fonts for fields byte[] arialNorm = extractFont("arial.ttf"); //byte[] arialBold = extractFont("arialbd.ttf"); //PDFont font = PDType1Font.HELVETICA; PDFont font = PDTrueTypeFont.loadTTF(doc, new ByteArrayInputStream(arialNorm)); PDPageContentStream contentStream = new PDPageContentStream(doc, page); contentStream.beginText(); contentStream.setFont(font, 12); contentStream.moveTextPositionByAmount(100, 700); contentStream.drawString("Hello world from PDFBox ελληνικά"); // text here may appear garbled; insert any text in Greek or Bulgarian or Malteze contentStream.endText(); contentStream.close(); doc.save("pdfbox.pdf"); System.out.println(" created!"); }

      catch (Exception ioe)

      { ioe.printStackTrace(); }

      finally {
      if (doc != null) {
      try

      { doc.close(); }

      catch (Exception e) {}
      }
      }

        Issue Links

          Activity

          Hide
          Thanos Agelatos added a comment -

          No PDF expert but would the 'reverse' of work done in PDFBOX-654 be sufficient to be able to encode Identity-H in new PDFs?

          Show
          Thanos Agelatos added a comment - No PDF expert but would the 'reverse' of work done in PDFBOX-654 be sufficient to be able to encode Identity-H in new PDFs?
          Hide
          Andreas Lehmkühler added a comment -

          PDFBOX-654 is about text extraction. But you are correct, WinAnsiEncoding is hardcoded inside PDTrueTypeFont. For now PDFBox hasn't any support for Identiy-H as encoding when adding text.

          Show
          Andreas Lehmkühler added a comment - PDFBOX-654 is about text extraction. But you are correct, WinAnsiEncoding is hardcoded inside PDTrueTypeFont. For now PDFBox hasn't any support for Identiy-H as encoding when adding text.
          Hide
          Thanos Agelatos added a comment -

          Andreas,
          thank you for the reply. I assumed that since code affected from PDFBOX-654 does some Identity-H parsing from the PDF the opposite could achieve what is requested here. Anyways, do you have some planning on when this feature will be come available? Limiting to WinAnsi is taking out many of the languages that we want PDFs generated for.

          thanks in advance
          Thanos

          Show
          Thanos Agelatos added a comment - Andreas, thank you for the reply. I assumed that since code affected from PDFBOX-654 does some Identity-H parsing from the PDF the opposite could achieve what is requested here. Anyways, do you have some planning on when this feature will be come available? Limiting to WinAnsi is taking out many of the languages that we want PDFs generated for. thanks in advance Thanos
          Hide
          Wolfgang Glas added a comment -

          I have implemented a glyph extractor, so that subfonts with less than or equal 256 glyphs may be extracted from a large TTF font.

          The code may be found there and is licensed under the terms of the apache licenese:

          http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/TTFSubFont.java

          We use the the code successfully to write PDFs with full-featured unicode strings by splitting the TTF font to smaller subfonts.

          Show
          Wolfgang Glas added a comment - I have implemented a glyph extractor, so that subfonts with less than or equal 256 glyphs may be extracted from a large TTF font. The code may be found there and is licensed under the terms of the apache licenese: http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/TTFSubFont.java We use the the code successfully to write PDFs with full-featured unicode strings by splitting the TTF font to smaller subfonts.
          Hide
          Charlie B added a comment -

          Wow. Great timing .. found it hard to believe that this issue wasn't getting more traction. I recently discovered PDFBox and coded up a prototype for generating docs ...only to find this show stopper.

          Wolfgang could you give a bit more detail on how you use the extractor? Looks like in EntitiesPdfRenderer .. you have a getSubFont(), but I'm not quite sure how to apply the subfonts.

          Also, is there any plan at all to support full TTFs in PDFBox proper?

          Show
          Charlie B added a comment - Wow. Great timing .. found it hard to believe that this issue wasn't getting more traction. I recently discovered PDFBox and coded up a prototype for generating docs ...only to find this show stopper. Wolfgang could you give a bit more detail on how you use the extractor? Looks like in EntitiesPdfRenderer .. you have a getSubFont(), but I'm not quite sure how to apply the subfonts. Also, is there any plan at all to support full TTFs in PDFBox proper?
          Hide
          Andreas Lehmkühler added a comment -

          @Wolfgang
          Sounds good to me, but I can't find any license information neither as header nor somewhere on the website. Can you somehow add those information, just for the record?

          Show
          Andreas Lehmkühler added a comment - @Wolfgang Sounds good to me, but I can't find any license information neither as header nor somewhere on the website. Can you somehow add those information, just for the record?
          Hide
          Wolfgang Glas added a comment -

          Andreas, I've added the apache licensing terms to the TTFSubFont file.

          @Charlie: Inside

          http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/EntitiesPdfRenderer.java

          You find an example on how to construct 256 glyph subfonts out of a large uncode-support font.

          http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/PdfRenderContextImpl.java

          gives you the code to draw a string composed out of multiple unicode blocks.

          Please note, that you have to set the /Length1 property of the embedded TTF font stream. (Therefore I introduced the new PDTrueTypeFont.loadFont(PDStream, Encoding) method, note the PDStream argument...)

          Furthermore, I wrote my own accessor interface to adobe's glyphlist, because the pdfbox API inside pdfbox's Encoding is not optimal. (unicode code points are not represented as int's, no static accessor to the parsed glyph list...)

          And yes, I'd really like to see this integrated into pdfbox, but as I pointed out it will need some finetuning, OpenType support, testing etc...

          My class is geared towards Microsoft's core fonts and not more.

          Best regards, Wolfgang

          Show
          Wolfgang Glas added a comment - Andreas, I've added the apache licensing terms to the TTFSubFont file. @Charlie: Inside http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/EntitiesPdfRenderer.java You find an example on how to construct 256 glyph subfonts out of a large uncode-support font. http://svn.clazzes.org/svn/sketch/trunk/pdf/pdf-entities/src/main/java/org/clazzes/sketch/pdf/entities/impl/PdfRenderContextImpl.java gives you the code to draw a string composed out of multiple unicode blocks. Please note, that you have to set the /Length1 property of the embedded TTF font stream. (Therefore I introduced the new PDTrueTypeFont.loadFont(PDStream, Encoding) method, note the PDStream argument...) Furthermore, I wrote my own accessor interface to adobe's glyphlist, because the pdfbox API inside pdfbox's Encoding is not optimal. (unicode code points are not represented as int's, no static accessor to the parsed glyph list...) And yes, I'd really like to see this integrated into pdfbox, but as I pointed out it will need some finetuning, OpenType support, testing etc... My class is geared towards Microsoft's core fonts and not more. Best regards, Wolfgang
          Hide
          Charlie B added a comment -

          @Wolfgang

          Thanks for the pointers ... after spending some time with the code I think I get the patterns - very cool work ... are you willing to share the mods you've made PDFBox (new loadFont method and Encoding changes)?

          Show
          Charlie B added a comment - @Wolfgang Thanks for the pointers ... after spending some time with the code I think I get the patterns - very cool work ... are you willing to share the mods you've made PDFBox (new loadFont method and Encoding changes)?
          Hide
          Wolfgang Glas added a comment -

          Charlie,

          I forgot to mention, that my work is based on the patch attached to PDFBOX-954.
          There you will find the improved loadFont() method and the according Encoding changes.

          Andreas and I have arranged a metting next week, where we will discuss on how to integrate my patch into pdfbox. Furthermore, we will work out a way on how to further improve the TTF-Unicode suppor. Surely, we will report our findings to the mailinglist and open subsequent jira issues as required.

          Wolfgang

          Show
          Wolfgang Glas added a comment - Charlie, I forgot to mention, that my work is based on the patch attached to PDFBOX-954 . There you will find the improved loadFont() method and the according Encoding changes. Andreas and I have arranged a metting next week, where we will discuss on how to integrate my patch into pdfbox. Furthermore, we will work out a way on how to further improve the TTF-Unicode suppor. Surely, we will report our findings to the mailinglist and open subsequent jira issues as required. Wolfgang
          Hide
          Charlie B added a comment -

          Hi Wolfgang, Andreas,

          Wondering how your meeting went. Before I start customizing for TTF-Unicode writing I'd like to know more about any plans for productizing.

          Thanks!

          Show
          Charlie B added a comment - Hi Wolfgang, Andreas, Wondering how your meeting went. Before I start customizing for TTF-Unicode writing I'd like to know more about any plans for productizing. Thanks!
          Hide
          Wolfgang Glas added a comment -

          Hi Charlie,

          Sorry for coming up so late, stuffed with work here...

          Basically, Andreas and I agreed in introducing a unicode-aware showtext-API in pdfbox-2.0. I will announce plans on the mailinglist and create issues likewise, when the dust on my desk settles,

          Best regards, Wolfgang

          Show
          Wolfgang Glas added a comment - Hi Charlie, Sorry for coming up so late, stuffed with work here... Basically, Andreas and I agreed in introducing a unicode-aware showtext-API in pdfbox-2.0. I will announce plans on the mailinglist and create issues likewise, when the dust on my desk settles, Best regards, Wolfgang
          Hide
          Charlie B added a comment -

          Hi Wolfgang, Andreas,

          Again I'm wondering if you have any solid plans for unicode text API in the near future?

          Thanks for any info or ETA on 2.0.

          Show
          Charlie B added a comment - Hi Wolfgang, Andreas, Again I'm wondering if you have any solid plans for unicode text API in the near future? Thanks for any info or ETA on 2.0.
          Hide
          Dinko Ivanov added a comment -

          Hello Andreas,

          We need to export Cyrillic content in PDF files. We've already invested significant effort in facilitating PDFBox for our needs and would like to somehow workaround this problem.
          Do you have any update on plans for including this feature in PDFBox?

          @Wolfgang: Could you share some more details/basic steps on how the solution in Sketch framework could be reused?
          I tried a simple scenario (export Drawing containing Cyrillic symbols to PDF), but without success. I think I'm missing something.

          Thanks and regards,
          Dinko

          Show
          Dinko Ivanov added a comment - Hello Andreas, We need to export Cyrillic content in PDF files. We've already invested significant effort in facilitating PDFBox for our needs and would like to somehow workaround this problem. Do you have any update on plans for including this feature in PDFBox? @Wolfgang: Could you share some more details/basic steps on how the solution in Sketch framework could be reused? I tried a simple scenario (export Drawing containing Cyrillic symbols to PDF), but without success. I think I'm missing something. Thanks and regards, Dinko
          Hide
          Charlie B added a comment -

          Hi gang,

          Any update here? Even a very loose idea of timing would be very helpful.

          Thanks,

          • Charlie
          Show
          Charlie B added a comment - Hi gang, Any update here? Even a very loose idea of timing would be very helpful. Thanks, Charlie
          Hide
          Wolfgang Glas added a comment -

          Hi Charlie,

          I have bad news for you. We have an an enormous struggle to get our projects done this year and I really do not have capacities to dive into pdfbox any deeper. I can answer any kind of questions, if somebody steps up to do the implementation of a wider unicode support in pdfbox's writing API, but I cannot do the implementation and testing, sorry for that.

          I'd really love to come to ApacheCon in Sinsheim, but we do need our projects done, that's life

          Best regards, Wolfgang

          Show
          Wolfgang Glas added a comment - Hi Charlie, I have bad news for you. We have an an enormous struggle to get our projects done this year and I really do not have capacities to dive into pdfbox any deeper. I can answer any kind of questions, if somebody steps up to do the implementation of a wider unicode support in pdfbox's writing API, but I cannot do the implementation and testing, sorry for that. I'd really love to come to ApacheCon in Sinsheim, but we do need our projects done, that's life Best regards, Wolfgang
          Hide
          Andreas Lehmkühler added a comment -

          As a first step I added TTFSubFont support in revision 1413777 based on Wolfgang Glas code.

          Show
          Andreas Lehmkühler added a comment - As a first step I added TTFSubFont support in revision 1413777 based on Wolfgang Glas code.

            People

            • Assignee:
              Andreas Lehmkühler
              Reporter:
              Thanos Agelatos
            • Votes:
              17 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development