PDFBox
  1. PDFBox
  2. PDFBOX-940

[pdmodel.font.PDFont] Error: Could not parse predefined CMAP file for 'PDFXC-Indentity0-0'

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.8.4, 2.0.0
    • Component/s: None
    • Labels:
      None
    • Environment:
      Tomcat 6.0.18, windows server 2003, pdfbox-1.4.0.jar

      Description

      Hi,

      when i am trying to upload a pdf document the following error is thrown in the tomcat.. i am using pdfbox-1.4.0.jar..

      17:29:33,465 ERROR [pdmodel.font.PDFont] Error: Could not parse predefined CMAP file for 'PDFXC-Indentity0-0'

      please find the solution

      1. pdf properties3.JPG
        41 kB
        krishna
      2. pdf properties2.JPG
        47 kB
        krishna
      3. pdf properties1.JPG
        33 kB
        krishna
      4. pdf fonts2.JPG
        48 kB
        krishna
      5. pdf fonts1.JPG
        49 kB
        krishna
      6. pdf fonts.JPG
        43 kB
        krishna
      7. oob_pdf.pdf
        354 kB
        Henrique Nunes
      8. noabank_agb.pdf
        778 kB
        Tilman Hausherr
      9. gen_preview1.png
        13 kB
        Henrique Nunes

        Activity

        Hide
        Lars Torunski added a comment -

        Similiar problem: 2011-03-02 08:30:24,126 [PWS-Index-Thread-569] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-0'

        Both PDFXC-Indentity0-0 and Adobe-WinCharSetFFFF-0 aren't available in org/apache/pdfbox/resources/cmap/

        Show
        Lars Torunski added a comment - Similiar problem: 2011-03-02 08:30:24,126 [PWS-Index-Thread-569] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-0' Both PDFXC-Indentity0-0 and Adobe-WinCharSetFFFF-0 aren't available in org/apache/pdfbox/resources/cmap/
        Hide
        Andreas Lehmkühler added a comment -

        Please update to the newest version of PDFBox a try again. We added some text extraction improvements including some fixes for the handling of CMaps. Attach a sample pdf, if it still won't work.

        Show
        Andreas Lehmkühler added a comment - Please update to the newest version of PDFBox a try again. We added some text extraction improvements including some fixes for the handling of CMaps. Attach a sample pdf, if it still won't work.
        Hide
        Gabriel Gravel added a comment -

        I have a similar problem. I used to have the following error when extracting text from a batch of PDF files using pdfbox 1.3.1:
        ERROR 10 Mar 2011 00:22:44.038 [org.apache.pdfbox.pdmodel.font.PDFont] line:285 - Error: Could not parse predefined CMAP file for 'Adobe-UCS-0'
        After reading the comments here, I have upgraded to 1.5.0 and am now having the following error:
        ERROR 15 Mar 2011 14:31:10.195 [org.apache.pdfbox.pdmodel.font.PDCIDFont] line:324 - Error: Could not parse predefined CMAP file for 'Adobe-UCS-UCS2'

        However, I still seem to be able to extract the text correctly from the file. Should I be worried about this error or can I ignore it altogether? Here's a link to one of the problematic files: http://cbpp-pcpe.phac-aspc.gc.ca/intervention_pdf/en/72.pdf

        Thanks for your time

        Show
        Gabriel Gravel added a comment - I have a similar problem. I used to have the following error when extracting text from a batch of PDF files using pdfbox 1.3.1: ERROR 10 Mar 2011 00:22:44.038 [org.apache.pdfbox.pdmodel.font.PDFont] line:285 - Error: Could not parse predefined CMAP file for 'Adobe-UCS-0' After reading the comments here, I have upgraded to 1.5.0 and am now having the following error: ERROR 15 Mar 2011 14:31:10.195 [org.apache.pdfbox.pdmodel.font.PDCIDFont] line:324 - Error: Could not parse predefined CMAP file for 'Adobe-UCS-UCS2' However, I still seem to be able to extract the text correctly from the file. Should I be worried about this error or can I ignore it altogether? Here's a link to one of the problematic files: http://cbpp-pcpe.phac-aspc.gc.ca/intervention_pdf/en/72.pdf Thanks for your time
        Hide
        Andreas Lehmkühler added a comment -

        I solved the issue with Gabriels pdf in revision 1085755.

        @Lars, krishna
        Do you still have the described problem using the current trunk version? If the problem still persists, can you provide us with a sample pdf?

        Show
        Andreas Lehmkühler added a comment - I solved the issue with Gabriels pdf in revision 1085755. @Lars, krishna Do you still have the described problem using the current trunk version? If the problem still persists, can you provide us with a sample pdf?
        Hide
        Lars Torunski added a comment -

        Currently I can test the 1.5.0 version only.

        Show
        Lars Torunski added a comment - Currently I can test the 1.5.0 version only.
        Hide
        krishna added a comment -

        Hi Andreas,

        Because of security reasons, i can't upload the document...

        Error: Could not parse predefined CMAP file for 'PDFXC-Indentity0-0' problem was resolved in the 1.5.0 version, but Error: Could not parse predefined CMAP file for 'Adobe-UCS-UCS2' error was present their...

        Please check this error..

        Thanks,
        Murali

        Show
        krishna added a comment - Hi Andreas, Because of security reasons, i can't upload the document... Error: Could not parse predefined CMAP file for 'PDFXC-Indentity0-0' problem was resolved in the 1.5.0 version, but Error: Could not parse predefined CMAP file for 'Adobe-UCS-UCS2' error was present their... Please check this error.. Thanks, Murali
        Hide
        Lars Torunski added a comment -

        With 1.5.0 the error

        2011-03-28 10:38:20,207 [PWS-Index-Thread-35] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-0'

        doesn't occur anymore. But we are getting

        2011-03-28 11:52:51,162 [PWS-Index-Thread-44] ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2'

        with the same pdf file now.

        I'm not allowed to attach the pdf here, but I can send you the pdf by email.

        Show
        Lars Torunski added a comment - With 1.5.0 the error 2011-03-28 10:38:20,207 [PWS-Index-Thread-35] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-0' doesn't occur anymore. But we are getting 2011-03-28 11:52:51,162 [PWS-Index-Thread-44] ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2' with the same pdf file now. I'm not allowed to attach the pdf here, but I can send you the pdf by email.
        Hide
        Henrique Nunes added a comment - - edited

        Hi. I'm having the same problem:

        30/Mar/2011 17:15:10 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2'

        I'm using pdfbox-app-1.5.0.jar with Jython 2.5.2 on Windows 7 64bit

        No problems when on Ubuntu 10.

        UPDATE: I built pdfbox-app-1.6.0-SNAPSHOT from the latest sources and the issue persists.

        Show
        Henrique Nunes added a comment - - edited Hi. I'm having the same problem: 30/Mar/2011 17:15:10 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2' I'm using pdfbox-app-1.5.0.jar with Jython 2.5.2 on Windows 7 64bit No problems when on Ubuntu 10. UPDATE: I built pdfbox-app-1.6.0-SNAPSHOT from the latest sources and the issue persists.
        Hide
        Henrique Nunes added a comment -

        These are the files relevant for my comment below.

        Show
        Henrique Nunes added a comment - These are the files relevant for my comment below.
        Hide
        Che-wei Kuo added a comment - - edited

        Dear all,

        @Andreas Lehmkühler
        Thank you for dealing the original issue.
        However, there are still some errors after I build it in revision 1088324.
        The error messages becomes:

        "org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' "

        I'm not sure if it was the same issue.
        Thanks.

        Best Regards

        Show
        Che-wei Kuo added a comment - - edited Dear all, @Andreas Lehmkühler Thank you for dealing the original issue. However, there are still some errors after I build it in revision 1088324. The error messages becomes: "org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' " I'm not sure if it was the same issue. Thanks. Best Regards
        Hide
        Che-wei Kuo added a comment -

        Sorry, I got the wrong error message.

        It should be this one:

        "org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' "

        Show
        Che-wei Kuo added a comment - Sorry, I got the wrong error message. It should be this one: "org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' "
        Hide
        krishna added a comment -

        Hi

        'PDFXC-Indentity0-0' was fixed in 1.5.0 & 'Adobe-WinCharSetFFFF-UCS2' error was present there in 1.5.0

        Show
        krishna added a comment - Hi 'PDFXC-Indentity0-0' was fixed in 1.5.0 & 'Adobe-WinCharSetFFFF-UCS2' error was present there in 1.5.0
        Hide
        Joscha Feth added a comment -

        font.PDFont: Error: Could not parse predefined CMAP file for 'Adobe-UCS-0'

        still appearing in 1.5.0

        Show
        Joscha Feth added a comment - font.PDFont: Error: Could not parse predefined CMAP file for 'Adobe-UCS-0' still appearing in 1.5.0
        Hide
        Juhasz Istvan added a comment -

        SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS'
        revision 1139575 (1.6.0-SNAPSHOT)
        (pdf - embedded truetype (cid) font with encoding identity-h)

        pdf>java -jar pdfbox-app-1.6.0-SNAPSHOT.jar ExtractText -debug a015.pdf a015.txt
        Loading PDF a015.pdf
        Time for loading: 0.062 seconds
        Starting text extraction
        2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        INFO: cidSystemInfo: Adobe-UCS-0
        2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        INFO: resourceName: org/apache/pdfbox/resources/cmap/Adobe-Identity-UCS
        2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS'
        2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        INFO: cidSystemInfo: Adobe-UCS-0
        2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        INFO: resourceName: org/apache/pdfbox/resources/cmap/Adobe-Identity-UCS
        2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
        SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS'
        Time for extraction: 0.984 seconds

        Show
        Juhasz Istvan added a comment - SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' revision 1139575 (1.6.0-SNAPSHOT) (pdf - embedded truetype (cid) font with encoding identity-h) pdf>java -jar pdfbox-app-1.6.0-SNAPSHOT.jar ExtractText -debug a015.pdf a015.txt Loading PDF a015.pdf Time for loading: 0.062 seconds Starting text extraction 2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding INFO: cidSystemInfo: Adobe-UCS-0 2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding INFO: resourceName: org/apache/pdfbox/resources/cmap/Adobe-Identity-UCS 2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' 2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding INFO: cidSystemInfo: Adobe-UCS-0 2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding INFO: resourceName: org/apache/pdfbox/resources/cmap/Adobe-Identity-UCS 2011.06.26. 18:38:39 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding SEVERE: Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' Time for extraction: 0.984 seconds
        Hide
        Lars Torunski added a comment -

        Can we close this issue for PDFXC-Indentity0-0 and Adobe-WinCharSetFFFF-0?

        And create a new one for "UCS" with Adobe-Identity-UCS, Adobe-WinCharSetFFFF-UCS2, Adobe-UCS-0, Adobe-UCS-UCS2 etc.?

        Show
        Lars Torunski added a comment - Can we close this issue for PDFXC-Indentity0-0 and Adobe-WinCharSetFFFF-0? And create a new one for "UCS" with Adobe-Identity-UCS, Adobe-WinCharSetFFFF-UCS2, Adobe-UCS-0, Adobe-UCS-UCS2 etc.?
        Hide
        Kevin Clark added a comment -

        I'm getting this via the Tika 0.10 release which uses 1.6.0.

        2011-10-01 16:48:27,586 (55308987) [Parser-thread-2] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-0'

        Can't upload the pdf for privacy reasons, unfortunately.

        Show
        Kevin Clark added a comment - I'm getting this via the Tika 0.10 release which uses 1.6.0. 2011-10-01 16:48:27,586 (55308987) [Parser-thread-2] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-0' Can't upload the pdf for privacy reasons, unfortunately.
        Hide
        Antoni Mylka added a comment - - edited

        I stumbled upon the same problem, on a confidential file. In the process I think I found an issue: PDFBOX-1137.

        I'm not a PDF expert, but in that file, I have the following PDF objects:

        24 0 obj
        <</Type/Font/Subtype/Type0/BaseFont/TT491A9C96tCID/Encoding 18 0 R/DescendantFonts[22 0 R]>>
        endobj

        22 0 obj
        <</Subtype/CIDFontType2/CIDSystemInfo 23 0 R/BaseFont/XJXBKC+TT491A9C96tCID/Type/Font/Name/R22/FontDescriptor 21 0 R/DW 1000
        /W[691[259]
        724[677
        626
        626]
        737[677]]/CIDToGIDMap/Identity
        >>
        endobj

        18 0 obj
        <</Type/CMap/Name/R18/WMode 0/CMapName/WinCharSetFFFF-H/CIDSystemInfo<<
        /Registry(Adobe)
        /Ordering(WinCharSetFFFF)
        /Supplement 0
        >>
        /Filter/FlateDecode/Length 19 0 R>>stream
        (the binary content of the stream ommitted for readability)
        endstream
        endobj

        So there is an embedded CMAP for WinCharSetFFFF-H, a parent font which refers to the embedded CMAP as its encoding, and a child font with no encoding. Applying the PDFBOX-1137 patch allowed the CMAP to be parsed.

        Then, in PDType0Font constructor, I added an if, just after the descendant font is constructed, I made it "inherit" the cmap from the parent font. This fixed NPEs during text extraction, which happened because the cmap was missing:

        descendentFont = PDFontFactory.createFont( descendantFontDictionary );
        if (descendentFont.cmap == null)

        { descendentFont.cmap = this.cmap; }

        I don't even know if this makes sense. Is the descendant font supposed to "inherit" the encoding from the parent font? This "fixed" the visible errors, but the output I get is still garbled. It's supposed to be a text in traditional Chinese. Can anyone with more PDF knowledge take a look at this?

        Show
        Antoni Mylka added a comment - - edited I stumbled upon the same problem, on a confidential file. In the process I think I found an issue: PDFBOX-1137 . I'm not a PDF expert, but in that file, I have the following PDF objects: 24 0 obj <</Type/Font/Subtype/Type0/BaseFont/TT491A9C96tCID/Encoding 18 0 R/DescendantFonts [22 0 R] >> endobj 22 0 obj <</Subtype/CIDFontType2/CIDSystemInfo 23 0 R/BaseFont/XJXBKC+TT491A9C96tCID/Type/Font/Name/R22/FontDescriptor 21 0 R/DW 1000 /W[691 [259] 724[677 626 626] 737 [677] ]/CIDToGIDMap/Identity >> endobj 18 0 obj <</Type/CMap/Name/R18/WMode 0/CMapName/WinCharSetFFFF-H/CIDSystemInfo<< /Registry(Adobe) /Ordering(WinCharSetFFFF) /Supplement 0 >> /Filter/FlateDecode/Length 19 0 R>>stream (the binary content of the stream ommitted for readability) endstream endobj So there is an embedded CMAP for WinCharSetFFFF-H, a parent font which refers to the embedded CMAP as its encoding, and a child font with no encoding. Applying the PDFBOX-1137 patch allowed the CMAP to be parsed. Then, in PDType0Font constructor, I added an if, just after the descendant font is constructed, I made it "inherit" the cmap from the parent font. This fixed NPEs during text extraction, which happened because the cmap was missing: descendentFont = PDFontFactory.createFont( descendantFontDictionary ); if (descendentFont.cmap == null) { descendentFont.cmap = this.cmap; } I don't even know if this makes sense. Is the descendant font supposed to "inherit" the encoding from the parent font? This "fixed" the visible errors, but the output I get is still garbled. It's supposed to be a text in traditional Chinese. Can anyone with more PDF knowledge take a look at this?
        Hide
        Arjohn Kampman added a comment -

        I'm also seeing "Could not parse predefined CMAP file for 'Adobe-Identity-UCS'" error message on some files. Debugging this error with the current trunk (r1184806), I noticed that PDCIDFont.determineEncoding() starts with a cidSystemInfo value "Adobe-UCS-0" and replaces this with "Adobe-Identity-UCS" in the else-if-statement. This triggers the error message because there is no such cmap file.

        However, considering that the first if-statement maps any cidSystemInfo values containing "Identity" to "Identity-H", I'm wondering: should "Adobe-UCS-0" be mapped to "Identity-H" rather than "Adobe-Identity-UCS"?

        Show
        Arjohn Kampman added a comment - I'm also seeing "Could not parse predefined CMAP file for 'Adobe-Identity-UCS'" error message on some files. Debugging this error with the current trunk (r1184806), I noticed that PDCIDFont.determineEncoding() starts with a cidSystemInfo value "Adobe-UCS-0" and replaces this with "Adobe-Identity-UCS" in the else-if-statement. This triggers the error message because there is no such cmap file. However, considering that the first if-statement maps any cidSystemInfo values containing "Identity" to "Identity-H", I'm wondering: should "Adobe-UCS-0" be mapped to "Identity-H" rather than "Adobe-Identity-UCS"?
        Hide
        MH added a comment -

        Same error message here:

        SEVERE Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS'

        when PDF has under or over content (e.g. watermark). Without such content, the error does not appear. (PDFBox 1.6.0)

        Show
        MH added a comment - Same error message here: SEVERE Error: Could not parse predefined CMAP file for 'Adobe-Identity-UCS' when PDF has under or over content (e.g. watermark). Without such content, the error does not appear. (PDFBox 1.6.0)
        Hide
        ECI added a comment -

        I reproduce the issue with PDF-Box 1.4 used by Apache Tika:
        In my log file, I have 2012-01-25 16:47:03 ERROR 127.0.0.1 [PDFont:285] - Error: Could not parse predefined CMAP file for 'Adobe-UCS-0' .

        This is on some PDF documents only.

        Show
        ECI added a comment - I reproduce the issue with PDF-Box 1.4 used by Apache Tika: In my log file, I have 2012-01-25 16:47:03 ERROR 127.0.0.1 [PDFont:285] - Error: Could not parse predefined CMAP file for 'Adobe-UCS-0' . This is on some PDF documents only.
        Hide
        Herm added a comment -

        Error still there in 1.6.0. Issue still open and unresolved. When is a fix planned for this issue?

        Show
        Herm added a comment - Error still there in 1.6.0. Issue still open and unresolved. When is a fix planned for this issue?
        Hide
        Lars Torunski added a comment -

        My problem with Adobe-WinCharSetFFFF-UCS2 still exists in version 1.7.1:

        2012-08-02 10:19:25,771 [PWS-Index-Thread-41] ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2'

        Show
        Lars Torunski added a comment - My problem with Adobe-WinCharSetFFFF-UCS2 still exists in version 1.7.1: 2012-08-02 10:19:25,771 [PWS-Index-Thread-41] ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2'
        Hide
        TtheB added a comment -

        I confirm, that this is still not fixed in 1.7.1:
        2012-11-19 14:56:14,036 ERROR [scheduler_Worker-10] [pdfbox.pdmodel.font.PDCIDFont] determineEncoding Error: Could not parse predefined CMAP file for 'Adobe--UCS2'

        Show
        TtheB added a comment - I confirm, that this is still not fixed in 1.7.1: 2012-11-19 14:56:14,036 ERROR [scheduler_Worker-10] [pdfbox.pdmodel.font.PDCIDFont] determineEncoding Error: Could not parse predefined CMAP file for 'Adobe--UCS2'
        Hide
        Alex Wajda added a comment -

        We use 1.6.0 and the issue is there:

        ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2'

        Show
        Alex Wajda added a comment - We use 1.6.0 and the issue is there: ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Adobe-WinCharSetFFFF-UCS2'
        Hide
        Clemens Wyss added a comment -

        Using v1.6 (on tomcat@debian) and have got the problem
        ...
        2012-11-14 14:19:27,703 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'ArialUnicodeMS-ArialUnicodeMS-UCS2'
        2012-11-14 14:19:46,406 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Arial,Bold-Arial,Bold-UCS2'
        2012-11-14 14:20:02,958 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'ArialMT-ArialMT-UCS2'
        2012-11-14 14:20:02,999 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'ArialMT-ArialMT-UCS2'
        2012-11-14 14:20:05,072 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Arial,Bold-Arial,Bold-UCS2'
        ...

        Show
        Clemens Wyss added a comment - Using v1.6 (on tomcat@debian) and have got the problem ... 2012-11-14 14:19:27,703 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'ArialUnicodeMS-ArialUnicodeMS-UCS2' 2012-11-14 14:19:46,406 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Arial,Bold-Arial,Bold-UCS2' 2012-11-14 14:20:02,958 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'ArialMT-ArialMT-UCS2' 2012-11-14 14:20:02,999 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'ArialMT-ArialMT-UCS2' 2012-11-14 14:20:05,072 ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont - Error: Could not parse predefined CMAP file for 'Arial,Bold-Arial,Bold-UCS2' ...
        Hide
        Wolfgang Kronberg added a comment -

        I still see this issue with 1.8.0 and 1.9.0-SNAPSHOT. In my case, the filename consists of binary rubbish, plus '-UCS2'.

        Looking at the code of PDCIDFont.determineEncoding(), it seems to me that the error message is misleading:

        cmap = parseCmap( resourceRootCMAP, ResourceLoader.loadResource( resourceName ));
        if( cmap == null)

        { log.error("Error: Could not parse predefined CMAP file for '" + cidSystemInfo + "'" ); }

        Obviously, the message is so harsh because parseCmap() of a predefined file (included with pdfbox) must never fail, otherwise it would be a bug in pdfbox. Usually, however, the reason for this message is not failing parsing, but simply that there is no predefined file for the given ressource name.

        In my opinion, such a case should not be treated more harshly than the case that getCIDSystemInfo() yields null in the first place. PDCIDFont.determineEncoding() handles this case by silently calling super.determineEncoding(), which usually completes without any errors. Thus, in my opinion, the code snippet above should be changed to:

        InputStream resIn = ResourceLoader.loadResource( resourceName );
        if (resIn != null) {
        cmap = parseCmap( resourceRootCMAP, resIn);
        if( cmap == null)

        { log.error("Error: Could not parse predefined CMAP file for '" + cidSystemInfo + "'" ); }

        } else

        { super.determineEncoding(); }

        Anyway, the binary rubbbish I observe probably points to some other bug, and I have not been able to pin that one down. I have loads of PDF documents exhibiting this bug, all of them unfortunately being confidential. In case any team member is interested, please email me so that I can provide you with some examples.

        Show
        Wolfgang Kronberg added a comment - I still see this issue with 1.8.0 and 1.9.0-SNAPSHOT. In my case, the filename consists of binary rubbish, plus '-UCS2'. Looking at the code of PDCIDFont.determineEncoding(), it seems to me that the error message is misleading: cmap = parseCmap( resourceRootCMAP, ResourceLoader.loadResource( resourceName )); if( cmap == null) { log.error("Error: Could not parse predefined CMAP file for '" + cidSystemInfo + "'" ); } Obviously, the message is so harsh because parseCmap() of a predefined file (included with pdfbox) must never fail, otherwise it would be a bug in pdfbox. Usually, however, the reason for this message is not failing parsing, but simply that there is no predefined file for the given ressource name. In my opinion, such a case should not be treated more harshly than the case that getCIDSystemInfo() yields null in the first place. PDCIDFont.determineEncoding() handles this case by silently calling super.determineEncoding(), which usually completes without any errors. Thus, in my opinion, the code snippet above should be changed to: InputStream resIn = ResourceLoader.loadResource( resourceName ); if (resIn != null) { cmap = parseCmap( resourceRootCMAP, resIn); if( cmap == null) { log.error("Error: Could not parse predefined CMAP file for '" + cidSystemInfo + "'" ); } } else { super.determineEncoding(); } Anyway, the binary rubbbish I observe probably points to some other bug, and I have not been able to pin that one down. I have loads of PDF documents exhibiting this bug, all of them unfortunately being confidential. In case any team member is interested, please email me so that I can provide you with some examples.
        Hide
        Tilman Hausherr added a comment -

        1)
        @Wolfgang: I had a look at the file I just uploaded with notepad++ (search for "Ordering"). It really has all the trash characters that appear in the error message. Maybe do the same with your files. (Btw I agree with your last comment after I had a look at the source)

        2)
        About the WinCharSetFFFF error: if my understanding is correct, whats missing is a file named "Adobe-WinCharSetFFFF-UCS2" in the pdfbox/resources/cmap directory. I found such a file here:
        http://bugs.ghostscript.com/attachment.cgi?id=4909
        linked from
        http://bugs.ghostscript.com/show_bug.cgi?id=690393
        My understanding of cmaps is very basic so I don't know if that file is whats needed. How would I see that something gets "better"?

        Show
        Tilman Hausherr added a comment - 1) @Wolfgang: I had a look at the file I just uploaded with notepad++ (search for "Ordering"). It really has all the trash characters that appear in the error message. Maybe do the same with your files. (Btw I agree with your last comment after I had a look at the source) 2) About the WinCharSetFFFF error: if my understanding is correct, whats missing is a file named "Adobe-WinCharSetFFFF-UCS2" in the pdfbox/resources/cmap directory. I found such a file here: http://bugs.ghostscript.com/attachment.cgi?id=4909 linked from http://bugs.ghostscript.com/show_bug.cgi?id=690393 My understanding of cmaps is very basic so I don't know if that file is whats needed. How would I see that something gets "better"?
        Hide
        Andreas Lehmkühler added a comment -

        1) The "trash" appears because the pdf is encrypted. You should try the PDFDebugger to examine pdfs, it's easy to use and will automatically decrypt encrypted docs

        2) After overhauling the font rendering stuff I assume that some of the encoding code has to be refactored as well. The error can be most likely ignored.

        Show
        Andreas Lehmkühler added a comment - 1) The "trash" appears because the pdf is encrypted. You should try the PDFDebugger to examine pdfs, it's easy to use and will automatically decrypt encrypted docs 2) After overhauling the font rendering stuff I assume that some of the encoding code has to be refactored as well. The error can be most likely ignored.
        Hide
        Andreas Lehmkühler added a comment -

        I removed the misleading error message in revision 1554632 based on Wolfgangs proposal.

        I'll do some more research before closing this issue.

        Show
        Andreas Lehmkühler added a comment - I removed the misleading error message in revision 1554632 based on Wolfgangs proposal. I'll do some more research before closing this issue.
        Hide
        Andreas Lehmkühler added a comment -

        I added the changes to the 1.8 branch in revision 1554645.

        As the main problem (false error message) is resolved I set this issue to resolved.

        Possible remaining issues with CMaps should be reported using a new issue if there isn't already one.

        Thanks to all helping us to resolve this one!

        Show
        Andreas Lehmkühler added a comment - I added the changes to the 1.8 branch in revision 1554645. As the main problem (false error message) is resolved I set this issue to resolved. Possible remaining issues with CMaps should be reported using a new issue if there isn't already one. Thanks to all helping us to resolve this one!

          People

          • Assignee:
            Andreas Lehmkühler
            Reporter:
            krishna
          • Votes:
            16 Vote for this issue
            Watchers:
            22 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 48h
              48h
              Remaining:
              Remaining Estimate - 48h
              48h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development