PDFBox
  1. PDFBox
  2. PDFBOX-1279

Preflight reports "1.1 : Body Syntax error"

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.7.0
    • Fix Version/s: 1.7.0
    • Component/s: Preflight
    • Labels:
      None
    • Environment:
      Win 7 64Bit, jre 1.6.31

      Description

      Just tried the PDF/A Validation. It fails on the attached pdf with "1.1 : Body Syntax error". Adobe Preflight reports success for both pdf/a level a and pdf/a level b validation. PDF was created with plain LibreOffice 3.5.2 (export as pdf, using pdf/a level a).

      1. input_pdf_a_lvl_a_libreoffice_352.pdf
        22 kB
        beat weisskopf
      2. pdfbox_1279_cs.patch
        2 kB
        Guillaume Bailleul

        Activity

        Hide
        Guillaume Bailleul added a comment -

        build failed with getBytes (Charset) not existing in java 5.
        problem fixed in r1336380

        Show
        Guillaume Bailleul added a comment - build failed with getBytes (Charset) not existing in java 5. problem fixed in r1336380
        Hide
        Guillaume Bailleul added a comment -

        resolved in revision r1336366

        Show
        Guillaume Bailleul added a comment - resolved in revision r1336366
        Hide
        Guillaume Bailleul added a comment -

        @Eric

        I understand that there can be any value of 8 bits characters.
        ISO-8859-1 defines a character for each value, this is not the case for Cp1252 (81, 8d, 8f, 90, 9D are not used).

        So I apply that patch :

        • the InputStreamParser used by javacc is initialized specifying a charset (ISO-8859-1)
        • in the grammar, the charset is always specified in getBytes

        No link but I also removed the project.build.sourceEncoding in preflight which was overriding the pdfbox one with no (good) reason.

        Show
        Guillaume Bailleul added a comment - @Eric I understand that there can be any value of 8 bits characters. ISO-8859-1 defines a character for each value, this is not the case for Cp1252 (81, 8d, 8f, 90, 9D are not used). So I apply that patch : the InputStreamParser used by javacc is initialized specifying a charset (ISO-8859-1) in the grammar, the charset is always specified in getBytes No link but I also removed the project.build.sourceEncoding in preflight which was overriding the pdfbox one with no (good) reason.
        Hide
        William Fausser added a comment -

        Hi ,

        Thanks to all. I'm still much a novice at exploring your code.

        The encodings will probably clear up problems that I had experienced with earlier tests ( last year) that I blamed on "Strict" parsing rules
        in the JavaCC. In my earlier test sets I was getting the dreaded "1.1 : Body Syntax error" often on the second/third line of the PDF.

        Best Regards,
        Bill

        Show
        William Fausser added a comment - Hi , Thanks to all. I'm still much a novice at exploring your code. The encodings will probably clear up problems that I had experienced with earlier tests ( last year) that I blamed on "Strict" parsing rules in the JavaCC. In my earlier test sets I was getting the dreaded "1.1 : Body Syntax error" often on the second/third line of the PDF. Best Regards, Bill
        Hide
        Eric Leleu added a comment -

        Hi,

        In the PDF Reference, we can read :

        "... PDF can be entirely represented using byte values corresponding to the visible printable subset of the ASCII character set, plus white space characters such as space, tab, carriage return, and line feed characters. ASCII is the American Standard Code for Information Interchange, a widely used convention for encoding a specific set of 128 characters as binary numbers. However, a PDF file is not restricted to the ASCII character set; it can contain arbitrary 8-bit bytes,..."

        So there are no recommended Charset... However instead of UTF-8, the default one should be US-ASCII or ISO-8859-1.

        The problem comes from the comment line containing at least 4 binary characters (code >= 128) that comes just after the header line. As far as I remember, to match binary characters in JavaCC we must describe them using the Unicode notation (\uxxxx). With the charset CP1252, the character <9F> can't match with the token BINARY([\u0080-\u00FF]), because it is linked with the unicode character \u0178. (See http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT)

        So we have 3 possibilities :

        [1] - Find a way to specify binary charaters without unicode notation in JavaCC

        [2] - Add all unicode exceptions for the Cp1252 in the Binary token description

        [3] - Update the BINARY token with [\u0080-\uFFFF] to avoid others charset specificities.

        I prefer the first one, but if we can't do it maybe the third one will be the best to avoid further issues.

        With following encodings, I run all my test set with the third option successfully :

        • US-ASCII
        • Cp1252
        • ISO-8859-1
        • utf8

        BR,
        Eric

        Show
        Eric Leleu added a comment - Hi, In the PDF Reference, we can read : "... PDF can be entirely represented using byte values corresponding to the visible printable subset of the ASCII character set, plus white space characters such as space, tab, carriage return, and line feed characters. ASCII is the American Standard Code for Information Interchange, a widely used convention for encoding a specific set of 128 characters as binary numbers. However, a PDF file is not restricted to the ASCII character set; it can contain arbitrary 8-bit bytes,..." So there are no recommended Charset... However instead of UTF-8, the default one should be US-ASCII or ISO-8859-1. The problem comes from the comment line containing at least 4 binary characters (code >= 128) that comes just after the header line. As far as I remember, to match binary characters in JavaCC we must describe them using the Unicode notation (\uxxxx). With the charset CP1252, the character <9F> can't match with the token BINARY( [\u0080-\u00FF] ), because it is linked with the unicode character \u0178. (See http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT ) So we have 3 possibilities : [1] - Find a way to specify binary charaters without unicode notation in JavaCC [2] - Add all unicode exceptions for the Cp1252 in the Binary token description [3] - Update the BINARY token with [\u0080-\uFFFF] to avoid others charset specificities. I prefer the first one, but if we can't do it maybe the third one will be the best to avoid further issues. With following encodings, I run all my test set with the third option successfully : US-ASCII Cp1252 ISO-8859-1 utf8 BR, Eric
        Hide
        Guillaume Bailleul added a comment -

        Fix proposition

        Show
        Guillaume Bailleul added a comment - Fix proposition
        Hide
        Guillaume Bailleul added a comment -

        The problem is in the initialization of the parser.
        No charset is provided to create the javacc SimpleCharStream, so default is used.

        I propose that patch to fix it. Quite simple at last. It works with utf-8 or iso-8859-1...

        I am not really sure it is the good way : where to find the good encoding ? does it matter ?

        @Eric : I really need your opinion on that point

        Show
        Guillaume Bailleul added a comment - The problem is in the initialization of the parser. No charset is provided to create the javacc SimpleCharStream, so default is used. I propose that patch to fix it. Quite simple at last. It works with utf-8 or iso-8859-1... I am not really sure it is the good way : where to find the good encoding ? does it matter ? @Eric : I really need your opinion on that point
        Hide
        Guillaume Bailleul added a comment -

        Adding a printStackTrace in HeaderParseException.getErrorCode, we have (linux with file encoding cp1252) :

        org.apache.padaf.preflight.HeaderParseException: Lexical error at line 2, column 9. Encountered: "\u0178" (376), after : ""
        at org.apache.padaf.preflight.javacc.PDFParser.PDF_header(PDFParser.java:591)
        at org.apache.padaf.preflight.javacc.PDFParser.PDF(PDFParser.java:837)
        at org.apache.padaf.preflight.PdfA1bValidator.validate(PdfA1bValidator.java:61)
        at org.apache.padaf.preflight.Validator_A1b.main(Validator_A1b.java:51)

        There are also some getBytes() in javacc generated sources :

        target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: literalLength += currentToken.image.getBytes().length;
        target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (previous Unable to render embedded object: File (= null && previous.image.getBytes()[previous.image.getBytes().length-1]) not found.='\u005c\u005c') {
        target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (previous Unable to render embedded object: File (= null && previous.image.getBytes()[previous.image.getBytes().length-1]) not found.='\u005c\u005c') {
        target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (previous Unable to render embedded object: File (= null && previous.image.getBytes()[previous.image.getBytes().length-1]) not found.='\u005c\u005c') {
        target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (token != null && token.image.getBytes().length > MAX_NAME_SIZE) {
        target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: throw new PdfParseException("Object Name is toot long : " + token.image.getBytes().length, ERROR_SYNTAX_NAME_TOO_LONG);
        target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (token != null && token.image.getBytes().length < 4) {

        Show
        Guillaume Bailleul added a comment - Adding a printStackTrace in HeaderParseException.getErrorCode, we have (linux with file encoding cp1252) : org.apache.padaf.preflight.HeaderParseException: Lexical error at line 2, column 9. Encountered: "\u0178" (376), after : "" at org.apache.padaf.preflight.javacc.PDFParser.PDF_header(PDFParser.java:591) at org.apache.padaf.preflight.javacc.PDFParser.PDF(PDFParser.java:837) at org.apache.padaf.preflight.PdfA1bValidator.validate(PdfA1bValidator.java:61) at org.apache.padaf.preflight.Validator_A1b.main(Validator_A1b.java:51) There are also some getBytes() in javacc generated sources : target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: literalLength += currentToken.image.getBytes().length; target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (previous Unable to render embedded object: File (= null && previous.image.getBytes()[previous.image.getBytes().length-1]) not found. ='\u005c\u005c') { target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (previous Unable to render embedded object: File (= null && previous.image.getBytes()[previous.image.getBytes().length-1]) not found. ='\u005c\u005c') { target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (previous Unable to render embedded object: File (= null && previous.image.getBytes()[previous.image.getBytes().length-1]) not found. ='\u005c\u005c') { target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (token != null && token.image.getBytes().length > MAX_NAME_SIZE) { target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: throw new PdfParseException("Object Name is toot long : " + token.image.getBytes().length, ERROR_SYNTAX_NAME_TOO_LONG); target/generated-sources/javacc/org/apache/padaf/preflight/javacc/PDFParser.java: if (token != null && token.image.getBytes().length < 4) {
        Hide
        beat weisskopf added a comment -

        Hi Bill

        Tried it on my linux box, there it works. Also got it to work on my windows box with specifying the encoding:

        C:\tmp>java -jar -Dfile.encoding=iso-8859-1 preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf
        log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine).
        log4j:WARN Please initialize the log4j system properly.
        The file input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file

        C:\tmp>java -jar -Dfile.encoding=utf-8 preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf
        log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine).
        log4j:WARN Please initialize the log4j system properly.
        The file input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file

        C:\tmp>java -jar -Dfile.encoding=cp1252 preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf
        log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine).
        log4j:WARN Please initialize the log4j system properly.
        The fileinput_pdf_a_lvl_a_libreoffice_352.pdf is not valid, error(s) :
        1.1 : Body Syntax error

        Had a quick search over the code, there are possible issues in StreamValidationHelper and TrailerValidationHelper (new String(..) without encoding, getBytes(..) without encoding). But not sure what to specify there...

        thanks, beat

        Show
        beat weisskopf added a comment - Hi Bill Tried it on my linux box, there it works. Also got it to work on my windows box with specifying the encoding: C:\tmp>java -jar -Dfile.encoding=iso-8859-1 preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine). log4j:WARN Please initialize the log4j system properly. The file input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file C:\tmp>java -jar -Dfile.encoding=utf-8 preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine). log4j:WARN Please initialize the log4j system properly. The file input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file C:\tmp>java -jar -Dfile.encoding=cp1252 preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine). log4j:WARN Please initialize the log4j system properly. The fileinput_pdf_a_lvl_a_libreoffice_352.pdf is not valid, error(s) : 1.1 : Body Syntax error Had a quick search over the code, there are possible issues in StreamValidationHelper and TrailerValidationHelper (new String(..) without encoding, getBytes(..) without encoding). But not sure what to specify there... thanks, beat
        Hide
        William Fausser added a comment -

        Hi Beat,

        One difference for me is that I'm using a full path to the stated jar file. Getting as close to yours, but using a full path:

        [fausser@sally gba-awl-padaf-354034e]$ java -jar /home/fausser/gba-awl-padaf-354034e/preflight/target/preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf
        log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine).
        log4j:WARN Please initialize the log4j system properly.
        The file /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file

        using just the jar:
        java -jar preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf
        log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine).
        log4j:WARN Please initialize the log4j system properly.
        The file /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file

        I running from a linux box.

        Show
        William Fausser added a comment - Hi Beat, One difference for me is that I'm using a full path to the stated jar file. Getting as close to yours, but using a full path: [fausser@sally gba-awl-padaf-354034e] $ java -jar /home/fausser/gba-awl-padaf-354034e/preflight/target/preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine). log4j:WARN Please initialize the log4j system properly. The file /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file using just the jar: java -jar preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine). log4j:WARN Please initialize the log4j system properly. The file /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file I running from a linux box.
        Hide
        beat weisskopf added a comment -

        Hi Bill, thanks for looking at it. Unfortunatly I could not get the build you used as I could not move back enough in jenkins-history. Upon creation of the bug entry, I built latest from svn. This time I used the build from jenkins (via the mentioned link). Still I am getting the same error:

        C:\tmp>java -jar preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf
        log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine).
        log4j:WARN Please initialize the log4j system properly.
        The fileinput_pdf_a_lvl_a_libreoffice_352.pdf is not valid, error(s) :
        1.1 : Body Syntax error

        Either it is a newly introduced bug since march 30th or something like a plattform encoding issue? Running on Win 7, swiss-german locale.

        Thanks a lot, beat

        Show
        beat weisskopf added a comment - Hi Bill, thanks for looking at it. Unfortunatly I could not get the build you used as I could not move back enough in jenkins-history. Upon creation of the bug entry, I built latest from svn. This time I used the build from jenkins (via the mentioned link). Still I am getting the same error: C:\tmp>java -jar preflight-1.7.0-20120410.222958-102-jar-with-dependencies.jar input_pdf_a_lvl_a_libreoffice_352.pdf log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.PDFStreamEngine). log4j:WARN Please initialize the log4j system properly. The fileinput_pdf_a_lvl_a_libreoffice_352.pdf is not valid, error(s) : 1.1 : Body Syntax error Either it is a newly introduced bug since march 30th or something like a plattform encoding issue? Running on Win 7, swiss-german locale. Thanks a lot, beat
        Hide
        William Fausser added a comment -

        Hi,
        Using the latest build with inpit_pdfa_lvl_a_libreoffice_352.pdf i get a valid PDF/A using preflight.

        output:
        /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file
        https://builds.apache.org/job/PDFBox-trunk/lastBuild/org.apache.pdfbox$preflight/

        preflight-1.7.0-20120330.162413-95-jar-with-dependencies.jar

        BR,
        Bill

        Show
        William Fausser added a comment - Hi, Using the latest build with inpit_pdfa_lvl_a_libreoffice_352.pdf i get a valid PDF/A using preflight. output: /home/fausser/input_pdf_a_lvl_a_libreoffice_352.pdf is a valid PDF/A-1b file https://builds.apache.org/job/PDFBox-trunk/lastBuild/org.apache.pdfbox$preflight/ preflight-1.7.0-20120330.162413-95-jar-with-dependencies.jar BR, Bill

          People

          • Assignee:
            Guillaume Bailleul
            Reporter:
            beat weisskopf
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development