PDFBox
  1. PDFBox
  2. PDFBOX-283

Character encoding/appearance issues when filling forms

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: AcroForm
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1735902
      Originally submitted by scop on 2007-06-12 10:23.

      When filling a text field with non-ASCII characters such as in my surname "Skyttä" and saving the document in a UTF-8 environment, something goes wrong with the appearance of the text.

      The value itself seems to be stored correctly, but when opening the doc, the appearance of "ä" is not that, but rather something which happens when UTF-8 is mistakenly treated as ISO-8859-1 (two garbage characters).

      PDAppearance uses the platform default encoding in quite a few places which apparently has potential to mess things up. In particular, insertGeneratedAppearance() generates a PrintWriter from an OutputStream without specifying the encoding. In fact, if I hack that to use ISO-8859-1, the appearance of my "ä" case is correct, but that won't obviously work with anything else than chars that are valid ISO-8859-1.

      In which char encoding should the value be written to the appearance stream (at end of insertGeneratedAppearance())?

      1. PDAppearance.patch
        0.8 kB
        Maruan Sahyoun

        Activity

        Hide
        Maruan Sahyoun added a comment -

        Tilman Hausherr Thx for applying the patch.

        We should keep the issue open because of the drawbacks mentioned above. In addition the patch creates additional objects which shall be improved. Having a general resolution depends on improvements to PDFWriter handling Unicode encodings (Identity-H encoding).

        Show
        Maruan Sahyoun added a comment - Tilman Hausherr Thx for applying the patch. We should keep the issue open because of the drawbacks mentioned above. In addition the patch creates additional objects which shall be improved. Having a general resolution depends on improvements to PDFWriter handling Unicode encodings (Identity-H encoding).
        Hide
        Tilman Hausherr added a comment - - edited

        I committed Maruans fix in rev 1587617 for the 1.8 branch and rev 1587616 for the trunk. (Sorry, forgot to add your name in the commit).

        Show
        Tilman Hausherr added a comment - - edited I committed Maruans fix in rev 1587617 for the 1.8 branch and rev 1587616 for the trunk. (Sorry, forgot to add your name in the commit).
        Hide
        Pasi Koski added a comment -

        I was able to fix non-ascii characters being messed up by applying the patch describe above. Thank you for the patch.

        I vote for applying the patch in the trunk.

        Also, it would be nice for new users to provide a sample of using AcroForm filling in the Cookbook. Any known issues could be described there as well (eg. character encoding only supports single byte character sets).

        Show
        Pasi Koski added a comment - I was able to fix non-ascii characters being messed up by applying the patch describe above. Thank you for the patch. I vote for applying the patch in the trunk. Also, it would be nice for new users to provide a sample of using AcroForm filling in the Cookbook. Any known issues could be described there as well (eg. character encoding only supports single byte character sets).
        Hide
        Maruan Sahyoun added a comment -

        I added a quick fix how a new field value is put into the appearance stream. The current implementation will only work for single byte character sets and as the fields value and the string representation of the value in the appearance stream are handled differently the display and the content are different.

        There are some issues with calculating the appearance stream for fields where there was already an existing one though, which should be addressed separately.

        The forms filling now works with german umlaut as well as the characters presented above.

        Show
        Maruan Sahyoun added a comment - I added a quick fix how a new field value is put into the appearance stream. The current implementation will only work for single byte character sets and as the fields value and the string representation of the value in the appearance stream are handled differently the display and the content are different. There are some issues with calculating the appearance stream for fields where there was already an existing one though, which should be addressed separately. The forms filling now works with german umlaut as well as the characters presented above.
        Hide
        Maruan Sahyoun added a comment -

        The patch ensures that non ISO-8859-1 characters are embedded in the appearance stream as a hex string similar to the fields value.

        Show
        Maruan Sahyoun added a comment - The patch ensures that non ISO-8859-1 characters are embedded in the appearance stream as a hex string similar to the fields value.
        Hide
        Bernd Köster added a comment -

        I did some work on the PDAppearace. You need to use the code above on every substring in the convertMulitline method.

        Show
        Bernd Köster added a comment - I did some work on the PDAppearace. You need to use the code above on every substring in the convertMulitline method.
        Hide
        Jukka Zitting added a comment -

        [Comment on SourceForge]
        Date: 2008-06-27 11:31
        Sender: nobody
        Logged In: NO

        I'm not sure PrintWriter is a lot of problem, If I understand it right
        (probably not),
        the chars written through PrintWriter should be US-ASCII anyway.

        I ran into the same problem and made a patch which more or less fixes it.
        But my problem is with
        multiline text boxes. If the multiline flag is enabled, then
        PDAppearance.setAppearanceValue
        will call convertToMultiline, and this will replace newlines in the value
        with a little
        PDF code. I think this PDF code will get escaped and will show in the
        rendered document with
        my changes.

        But then I wonder what happens, with or without my patch, if the field
        value contains ")","\" or other chars
        that should be escaped. In fact in some places in PDAppearance it seems it
        considers PDAppearance.value
        to be a clear unescaped value and in others it seems it should be escaped
        PDF code.

        I kept it unescaped, and escaped it in insertGeneratedAppearance(), but
        I was thinking of just storing in PDAppearance.value an escaped version,
        and in case of multiline being on then
        escape it line by line before applying convertToMultiline, but that would
        increase breakage
        in the fontSize and line length calculations, and I don't know how to fix
        that, because I'm not
        sure how much rendering calculations are wanted in PDFBox, and size
        calculations depend on
        rendering considerations. That is, I don't know which of the things that
        don't work are
        really meant to be fixed or are a designed limitation to keep it
        manageable.

        The patch :


        PDFBox-0.7.3-orig/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java
        2006-09-26 21:14:58.000000000 +0200
        +++ PDFBox-0.7.3/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java
        2008-06-27 13:15:24.000000000 +0200
        @@ -408,7 +408,12 @@

        { throw new IOException( "Error: Unknown justification value:" + q ); }
        • printWriter.println("(" + value + ") Tj");
          + COSString val = new COSString(value);
          + ByteArrayOutputStream valOutStream = new
          ByteArrayOutputStream();
          + // writePDF only writes US-ASCII chars (if value has
          anything else, uses
          + // hexadecimal representation, which is ascii)
          + val.writePDF(valOutStream);
          + printWriter.println(new String(valOutStream.toByteArray(),
          "US-ASCII") + " Tj");
          printWriter.println("ET" );
          printWriter.flush();
          }
        Show
        Jukka Zitting added a comment - [Comment on SourceForge] Date: 2008-06-27 11:31 Sender: nobody Logged In: NO I'm not sure PrintWriter is a lot of problem, If I understand it right (probably not), the chars written through PrintWriter should be US-ASCII anyway. I ran into the same problem and made a patch which more or less fixes it. But my problem is with multiline text boxes. If the multiline flag is enabled, then PDAppearance.setAppearanceValue will call convertToMultiline, and this will replace newlines in the value with a little PDF code. I think this PDF code will get escaped and will show in the rendered document with my changes. But then I wonder what happens, with or without my patch, if the field value contains ")","\" or other chars that should be escaped. In fact in some places in PDAppearance it seems it considers PDAppearance.value to be a clear unescaped value and in others it seems it should be escaped PDF code. I kept it unescaped, and escaped it in insertGeneratedAppearance(), but I was thinking of just storing in PDAppearance.value an escaped version, and in case of multiline being on then escape it line by line before applying convertToMultiline, but that would increase breakage in the fontSize and line length calculations, and I don't know how to fix that, because I'm not sure how much rendering calculations are wanted in PDFBox, and size calculations depend on rendering considerations. That is, I don't know which of the things that don't work are really meant to be fixed or are a designed limitation to keep it manageable. The patch : — PDFBox-0.7.3-orig/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java 2006-09-26 21:14:58.000000000 +0200 +++ PDFBox-0.7.3/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java 2008-06-27 13:15:24.000000000 +0200 @@ -408,7 +408,12 @@ { throw new IOException( "Error: Unknown justification value:" + q ); } printWriter.println("(" + value + ") Tj"); + COSString val = new COSString(value); + ByteArrayOutputStream valOutStream = new ByteArrayOutputStream(); + // writePDF only writes US-ASCII chars (if value has anything else, uses + // hexadecimal representation, which is ascii) + val.writePDF(valOutStream); + printWriter.println(new String(valOutStream.toByteArray(), "US-ASCII") + " Tj"); printWriter.println("ET" ); printWriter.flush(); }

          People

          • Assignee:
            Unassigned
            Reporter:
            Anonymous
          • Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development