PDFBox
  1. PDFBox
  2. PDFBOX-283

Character encoding/appearance issues when filling forms

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.0.0
    • Component/s: AcroForm
    • Labels:
      None

      Description

      [imported from SourceForge]
      http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1735902
      Originally submitted by scop on 2007-06-12 10:23.

      When filling a text field with non-ASCII characters such as in my surname "Skyttä" and saving the document in a UTF-8 environment, something goes wrong with the appearance of the text.

      The value itself seems to be stored correctly, but when opening the doc, the appearance of "ä" is not that, but rather something which happens when UTF-8 is mistakenly treated as ISO-8859-1 (two garbage characters).

      PDAppearance uses the platform default encoding in quite a few places which apparently has potential to mess things up. In particular, insertGeneratedAppearance() generates a PrintWriter from an OutputStream without specifying the encoding. In fact, if I hack that to use ISO-8859-1, the appearance of my "ä" case is correct, but that won't obviously work with anything else than chars that are valid ISO-8859-1.

      In which char encoding should the value be written to the appearance stream (at end of insertGeneratedAppearance())?

      1. PDAppearance.patch
        0.8 kB
        Maruan Sahyoun
      2. PDAppearance.diff
        2 kB
        Marco Primiceri
      3. PDAppearance_bis.diff
        2 kB
        Marco Primiceri
      4. acroform.pdf
        464 kB
        Marco Primiceri

        Issue Links

          Activity

          Hide
          John Hewson added a comment -

          PDFBOX-922 is fixes all Unicode font encoding issues in content streams, so this issue is now resolved.

          Show
          John Hewson added a comment - PDFBOX-922 is fixes all Unicode font encoding issues in content streams, so this issue is now resolved.
          Hide
          Maruan Sahyoun added a comment -

          There are still issues with the appearance generation but this is now being dealt with in PDFBOX-2333. Linking that and keeping it open to ensure we use the information provided as a testbed.

          This also relates to PDFBOX-922 to resolve the encoding issue. So linking that too.

          Show
          Maruan Sahyoun added a comment - There are still issues with the appearance generation but this is now being dealt with in PDFBOX-2333 . Linking that and keeping it open to ensure we use the information provided as a testbed. This also relates to PDFBOX-922 to resolve the encoding issue. So linking that too.
          Hide
          Tilman Hausherr added a comment -

          Brandon Lyon tells that he still has the problem with 2.0 (but not in 1.8.7), see at the end of PDFBOX-283.

          I can't reproduce it, but I have a different effect: when calling setValue() twice, the field is empty in the saved PDF.

          Show
          Tilman Hausherr added a comment - Brandon Lyon tells that he still has the problem with 2.0 (but not in 1.8.7), see at the end of PDFBOX-283 . I can't reproduce it, but I have a different effect: when calling setValue() twice, the field is empty in the saved PDF.
          Hide
          Marco Primiceri added a comment -

          Hi Tilman Hausherr,
          Thanks a lot for your help.
          I will be working on populating AcroForms for a while so I will raise any issue that I can find.
          Cheers,
          Marco

          Show
          Marco Primiceri added a comment - Hi Tilman Hausherr , Thanks a lot for your help. I will be working on populating AcroForms for a while so I will raise any issue that I can find. Cheers, Marco
          Hide
          John Hewson added a comment -

          Regarding PDTextbox: the various PDField subclasses were renamed in the trunk to match the PDF spec (some names were quite misleading). The hierarchy of classes was also changed as it didn't match the spec in 1.8.

          Show
          John Hewson added a comment - Regarding PDTextbox: the various PDField subclasses were renamed in the trunk to match the PDF spec (some names were quite misleading). The hierarchy of classes was also changed as it didn't match the spec in 1.8.
          Hide
          Tilman Hausherr added a comment -

          Thanks for the test code and the test file. PDTextbox is gone in the trunk but is in the 1.8 version, that got me confused. Anyway, now that I had your code and pdf I understand what was happening, and that there is no test at this time. I committed your changes in 1609211 in the 1.8 version and rev 1609212 in the trunk. Please do visual tests on everything acroform-related as much as you can to be sure.

          Show
          Tilman Hausherr added a comment - Thanks for the test code and the test file. PDTextbox is gone in the trunk but is in the 1.8 version, that got me confused. Anyway, now that I had your code and pdf I understand what was happening, and that there is no test at this time. I committed your changes in 1609211 in the 1.8 version and rev 1609212 in the trunk. Please do visual tests on everything acroform-related as much as you can to be sure.
          Hide
          Marco Primiceri added a comment -

          Hi Tilman Hausherr

          Sorry, I've missed the pdf when uploading the other files.
          Using PDVariableText would be the same as PDTextbox is its subclass.

          Let me know if you need anything else.
          Cheers,
          Marco

          Show
          Marco Primiceri added a comment - Hi Tilman Hausherr Sorry, I've missed the pdf when uploading the other files. Using PDVariableText would be the same as PDTextbox is its subclass. Let me know if you need anything else. Cheers, Marco
          Hide
          Tilman Hausherr added a comment - - edited

          Could you also attach the PDF, please? (Or code that creates a PDF with that field)
          I assume you mean PDVariableText, not PDTextbox.

          Show
          Tilman Hausherr added a comment - - edited Could you also attach the PDF, please? (Or code that creates a PDF with that field) I assume you mean PDVariableText, not PDTextbox.
          Hide
          Marco Primiceri added a comment -

          Hi Tilman Hausherr,
          I have attached the diff files to this Jira issue.
          Here is some sample code that sets a textbox as multiline and populates it with a multiline string

          		String filePath =  System.getProperty("user.home") + "/Desktop/acroform.pdf";
          		String textboxFieldName = "multiline_textbox";
          		PDDocument pdfDocument = PDDocument.load(filePath);
          		PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
          		PDTextbox textbox = (PDTextbox) acroForm.getField(textboxFieldName);
          		textbox.setMultiline(true);
          		acroForm.getField(textboxFieldName).setValue("Some text on the first line\nthis is a new line");
          		pdfDocument.save(filePath);
          		pdfDocument.close();
          
          Show
          Marco Primiceri added a comment - Hi Tilman Hausherr , I have attached the diff files to this Jira issue. Here is some sample code that sets a textbox as multiline and populates it with a multiline string String filePath = System .getProperty( "user.home" ) + "/Desktop/acroform.pdf" ; String textboxFieldName = "multiline_textbox" ; PDDocument pdfDocument = PDDocument.load(filePath); PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm(); PDTextbox textbox = (PDTextbox) acroForm.getField(textboxFieldName); textbox.setMultiline( true ); acroForm.getField(textboxFieldName).setValue( "Some text on the first line\nthis is a new line" ); pdfDocument.save(filePath); pdfDocument.close();
          Show
          Marco Primiceri added a comment - Files attachment as per comment: https://issues.apache.org/jira/browse/PDFBOX-283?focusedCommentId=14054707&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054707
          Hide
          Maruan Sahyoun added a comment -

          Tilman Hausherr I’d go for the second diff because of the font size calculation, haven’t tested it though.

          For the longer term appearance generation needs an overhaul e.g. because we currently do not take Rich Text Strings (PDF 1.6) into account.

          Show
          Maruan Sahyoun added a comment - Tilman Hausherr I’d go for the second diff because of the font size calculation, haven’t tested it though. For the longer term appearance generation needs an overhaul e.g. because we currently do not take Rich Text Strings (PDF 1.6) into account.
          Hide
          Tilman Hausherr added a comment -

          Could you please 1) attach these diffs as files (click "More") so it stays for a longer time, 2) please provide a code example that uses that feature, so that I can see that it makes a difference.
          Maruan Sahyoun WDYT? I would rather take the first diff because it is closer to the current code, I don't know enough to be sure about the 2nd diff.

          Show
          Tilman Hausherr added a comment - Could you please 1) attach these diffs as files (click "More") so it stays for a longer time, 2) please provide a code example that uses that feature, so that I can see that it makes a difference. Maruan Sahyoun WDYT? I would rather take the first diff because it is closer to the current code, I don't know enough to be sure about the 2nd diff.
          Hide
          Marco Primiceri added a comment -

          Hi Tilman Hausherr

          Thanks a lot for your quick reply!
          I apologise for the inconvenience but I realised too late that the change I have suggested would not have any effect as any tag representing a newline would be converted to hex string hence would not be printed out.

              printWriter.println("<" + new COSString(value).getHexString() + "> Tj");
          

          where value contains the multi-line conversion mentioned in my previous post

              result.append(" > Tj\n0 -13 Td\n<");
          
          • In order to solve this issue the method insertGeneratedAppearance() would need to tokenize the value based on the new line tag applied during the multi line conversion and then print the hex string for each line.
            This solution is not ideal, as I am splitting the value on the newline tag, but I have done it anyway for completeness.
            Please see attached diff file PDAppearance.diff
          • In my opinion a better solution would be to apply the multi-line conversion only when printing the value in insertGeneratedAppearance() rather than storing the converted string.
            Basically, I noticed that the multi line conversion happens at the beginning of setAppearanceValue() but it does not really need to be there: the new line tag ("> Tj\n0 -13 Td\n<") is only relevant when printing out value in insertGeneratedAppearance() (also, it might actually have a negative impact on the font size calculation which is based on the value's length).
            I have written a second, cleaner, solution which does the multi-line conversion on the fly when printing out value: please see attached diff file PDAppearance_bis.diff

          I have tested a snapshot for both solutions and they both work fine.
          Please let me know if you need more details about this.

          Kind Regards,
          Marco

          Show
          Marco Primiceri added a comment - Hi Tilman Hausherr Thanks a lot for your quick reply! I apologise for the inconvenience but I realised too late that the change I have suggested would not have any effect as any tag representing a newline would be converted to hex string hence would not be printed out. printWriter.println( "<" + new COSString(value).getHexString() + "> Tj" ); where value contains the multi-line conversion mentioned in my previous post result.append( " > Tj\n0 -13 Td\n<" ); In order to solve this issue the method insertGeneratedAppearance() would need to tokenize the value based on the new line tag applied during the multi line conversion and then print the hex string for each line. This solution is not ideal, as I am splitting the value on the newline tag, but I have done it anyway for completeness. Please see attached diff file PDAppearance.diff In my opinion a better solution would be to apply the multi-line conversion only when printing the value in insertGeneratedAppearance() rather than storing the converted string. Basically, I noticed that the multi line conversion happens at the beginning of setAppearanceValue() but it does not really need to be there: the new line tag ("> Tj\n0 -13 Td\n<") is only relevant when printing out value in insertGeneratedAppearance() (also, it might actually have a negative impact on the font size calculation which is based on the value's length). I have written a second, cleaner, solution which does the multi-line conversion on the fly when printing out value : please see attached diff file PDAppearance_bis.diff I have tested a snapshot for both solutions and they both work fine. Please let me know if you need more details about this. Kind Regards, Marco
          Hide
          Tilman Hausherr added a comment -

          Done in rev 1608502 for the 1.8 branch and rev 1608506 for the trunk. Sorry for the regression. The code made sense, I looked at the specification too, so I committed it. Please test a snapshot version to be sure that all is OK now. If possible, please submit a "minimal" code example that fails (e.g. with an exception) with the bad code, and doesn't fail with the good code. I would then add it as a test.

          Show
          Tilman Hausherr added a comment - Done in rev 1608502 for the 1.8 branch and rev 1608506 for the trunk. Sorry for the regression. The code made sense, I looked at the specification too, so I committed it. Please test a snapshot version to be sure that all is OK now. If possible, please submit a "minimal" code example that fails (e.g. with an exception) with the bad code, and doesn't fail with the good code. I would then add it as a test.
          Hide
          Marco Primiceri added a comment -

          Hello Tilman Hausherr

          Maruans patch has solved one of my issues as well (thanks!) but it introduced a new bug when filling multi-line text boxes.
          Do you mind applying this fix to the multi line conversion method as well?

          --- a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
          +++ b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java
          @@ -452,7 +452,7 @@
                   while( (currIdx = line.indexOf('\n',lastIdx )) > -1 )
                   {
                       result.append(line.substring(lastIdx,currIdx));
          -            result.append(" ) Tj\n0 -13 Td\n(");
          +            result.append(" > Tj\n0 -13 Td\n<");
                       lastIdx = currIdx + 1;
                   }
                   result.append(line.substring(lastIdx));
          

          Kind Regards,
          Marco

          Show
          Marco Primiceri added a comment - Hello Tilman Hausherr Maruans patch has solved one of my issues as well (thanks!) but it introduced a new bug when filling multi-line text boxes. Do you mind applying this fix to the multi line conversion method as well? --- a/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java +++ b/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/interactive/form/PDAppearance.java @@ -452,7 +452,7 @@ while ( (currIdx = line.indexOf('\n',lastIdx )) > -1 ) { result.append(line.substring(lastIdx,currIdx)); - result.append( " ) Tj\n0 -13 Td\n(" ); + result.append( " > Tj\n0 -13 Td\n<" ); lastIdx = currIdx + 1; } result.append(line.substring(lastIdx)); Kind Regards, Marco
          Hide
          Maruan Sahyoun added a comment -

          Tilman Hausherr Thx for applying the patch.

          We should keep the issue open because of the drawbacks mentioned above. In addition the patch creates additional objects which shall be improved. Having a general resolution depends on improvements to PDFWriter handling Unicode encodings (Identity-H encoding).

          Show
          Maruan Sahyoun added a comment - Tilman Hausherr Thx for applying the patch. We should keep the issue open because of the drawbacks mentioned above. In addition the patch creates additional objects which shall be improved. Having a general resolution depends on improvements to PDFWriter handling Unicode encodings (Identity-H encoding).
          Hide
          Tilman Hausherr added a comment - - edited

          I committed Maruans fix in rev 1587617 for the 1.8 branch and rev 1587616 for the trunk. (Sorry, forgot to add your name in the commit).

          Show
          Tilman Hausherr added a comment - - edited I committed Maruans fix in rev 1587617 for the 1.8 branch and rev 1587616 for the trunk. (Sorry, forgot to add your name in the commit).
          Hide
          Pasi Koski added a comment -

          I was able to fix non-ascii characters being messed up by applying the patch describe above. Thank you for the patch.

          I vote for applying the patch in the trunk.

          Also, it would be nice for new users to provide a sample of using AcroForm filling in the Cookbook. Any known issues could be described there as well (eg. character encoding only supports single byte character sets).

          Show
          Pasi Koski added a comment - I was able to fix non-ascii characters being messed up by applying the patch describe above. Thank you for the patch. I vote for applying the patch in the trunk. Also, it would be nice for new users to provide a sample of using AcroForm filling in the Cookbook. Any known issues could be described there as well (eg. character encoding only supports single byte character sets).
          Hide
          Maruan Sahyoun added a comment -

          I added a quick fix how a new field value is put into the appearance stream. The current implementation will only work for single byte character sets and as the fields value and the string representation of the value in the appearance stream are handled differently the display and the content are different.

          There are some issues with calculating the appearance stream for fields where there was already an existing one though, which should be addressed separately.

          The forms filling now works with german umlaut as well as the characters presented above.

          Show
          Maruan Sahyoun added a comment - I added a quick fix how a new field value is put into the appearance stream. The current implementation will only work for single byte character sets and as the fields value and the string representation of the value in the appearance stream are handled differently the display and the content are different. There are some issues with calculating the appearance stream for fields where there was already an existing one though, which should be addressed separately. The forms filling now works with german umlaut as well as the characters presented above.
          Hide
          Maruan Sahyoun added a comment -

          The patch ensures that non ISO-8859-1 characters are embedded in the appearance stream as a hex string similar to the fields value.

          Show
          Maruan Sahyoun added a comment - The patch ensures that non ISO-8859-1 characters are embedded in the appearance stream as a hex string similar to the fields value.
          Hide
          Bernd Köster added a comment -

          I did some work on the PDAppearace. You need to use the code above on every substring in the convertMulitline method.

          Show
          Bernd Köster added a comment - I did some work on the PDAppearace. You need to use the code above on every substring in the convertMulitline method.
          Hide
          Jukka Zitting added a comment -

          [Comment on SourceForge]
          Date: 2008-06-27 11:31
          Sender: nobody
          Logged In: NO

          I'm not sure PrintWriter is a lot of problem, If I understand it right
          (probably not),
          the chars written through PrintWriter should be US-ASCII anyway.

          I ran into the same problem and made a patch which more or less fixes it.
          But my problem is with
          multiline text boxes. If the multiline flag is enabled, then
          PDAppearance.setAppearanceValue
          will call convertToMultiline, and this will replace newlines in the value
          with a little
          PDF code. I think this PDF code will get escaped and will show in the
          rendered document with
          my changes.

          But then I wonder what happens, with or without my patch, if the field
          value contains ")","\" or other chars
          that should be escaped. In fact in some places in PDAppearance it seems it
          considers PDAppearance.value
          to be a clear unescaped value and in others it seems it should be escaped
          PDF code.

          I kept it unescaped, and escaped it in insertGeneratedAppearance(), but
          I was thinking of just storing in PDAppearance.value an escaped version,
          and in case of multiline being on then
          escape it line by line before applying convertToMultiline, but that would
          increase breakage
          in the fontSize and line length calculations, and I don't know how to fix
          that, because I'm not
          sure how much rendering calculations are wanted in PDFBox, and size
          calculations depend on
          rendering considerations. That is, I don't know which of the things that
          don't work are
          really meant to be fixed or are a designed limitation to keep it
          manageable.

          The patch :


          PDFBox-0.7.3-orig/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java
          2006-09-26 21:14:58.000000000 +0200
          +++ PDFBox-0.7.3/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java
          2008-06-27 13:15:24.000000000 +0200
          @@ -408,7 +408,12 @@

          { throw new IOException( "Error: Unknown justification value:" + q ); }
          • printWriter.println("(" + value + ") Tj");
            + COSString val = new COSString(value);
            + ByteArrayOutputStream valOutStream = new
            ByteArrayOutputStream();
            + // writePDF only writes US-ASCII chars (if value has
            anything else, uses
            + // hexadecimal representation, which is ascii)
            + val.writePDF(valOutStream);
            + printWriter.println(new String(valOutStream.toByteArray(),
            "US-ASCII") + " Tj");
            printWriter.println("ET" );
            printWriter.flush();
            }
          Show
          Jukka Zitting added a comment - [Comment on SourceForge] Date: 2008-06-27 11:31 Sender: nobody Logged In: NO I'm not sure PrintWriter is a lot of problem, If I understand it right (probably not), the chars written through PrintWriter should be US-ASCII anyway. I ran into the same problem and made a patch which more or less fixes it. But my problem is with multiline text boxes. If the multiline flag is enabled, then PDAppearance.setAppearanceValue will call convertToMultiline, and this will replace newlines in the value with a little PDF code. I think this PDF code will get escaped and will show in the rendered document with my changes. But then I wonder what happens, with or without my patch, if the field value contains ")","\" or other chars that should be escaped. In fact in some places in PDAppearance it seems it considers PDAppearance.value to be a clear unescaped value and in others it seems it should be escaped PDF code. I kept it unescaped, and escaped it in insertGeneratedAppearance(), but I was thinking of just storing in PDAppearance.value an escaped version, and in case of multiline being on then escape it line by line before applying convertToMultiline, but that would increase breakage in the fontSize and line length calculations, and I don't know how to fix that, because I'm not sure how much rendering calculations are wanted in PDFBox, and size calculations depend on rendering considerations. That is, I don't know which of the things that don't work are really meant to be fixed or are a designed limitation to keep it manageable. The patch : — PDFBox-0.7.3-orig/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java 2006-09-26 21:14:58.000000000 +0200 +++ PDFBox-0.7.3/src/org/pdfbox/pdmodel/interactive/form/PDAppearance.java 2008-06-27 13:15:24.000000000 +0200 @@ -408,7 +408,12 @@ { throw new IOException( "Error: Unknown justification value:" + q ); } printWriter.println("(" + value + ") Tj"); + COSString val = new COSString(value); + ByteArrayOutputStream valOutStream = new ByteArrayOutputStream(); + // writePDF only writes US-ASCII chars (if value has anything else, uses + // hexadecimal representation, which is ascii) + val.writePDF(valOutStream); + printWriter.println(new String(valOutStream.toByteArray(), "US-ASCII") + " Tj"); printWriter.println("ET" ); printWriter.flush(); }

            People

            • Assignee:
              Unassigned
              Reporter:
              Anonymous
            • Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development