Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4036

Invalid ToUnicode CMap in font

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 2.0.4, 2.0.8
    • None
    • Text extraction
    • None
    • Windows 10 64 bit, STS 3.9.1, JDK 1.8.0_152, Gradle

    Description

      While calling textStripper.getText(document) on the attached PDF file to extract text and save it to .txt, I receive following warnings:

      Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
      WARNING: Invalid ToUnicode CMap in font UYQXWX+MaterialIcons-Regular
      Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+380 (380) in font UYQXWX+MaterialIcons-Regular
      Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+381 (381) in font UYQXWX+MaterialIcons-Regular
      Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
      WARNING: Invalid ToUnicode CMap in font FANHRS+MaterialIcons-Regular
      Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+380 (380) in font FANHRS+MaterialIcons-Regular
      Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
      WARNING: No Unicode mapping for CID+381 (381) in font FANHRS+MaterialIcons-Regular

      In the end the file is generated and properly saved, but some letters are missing (like "ft" in "software" or "ff" in "different"). So far I've tested close to 10 files and this is the only problematic item I've found. Depending on what program I use to view the resulting .txt file, I either get blank spaces (Notepad) or "NUL" values (Notepad++) in place of the missing letters. What's more, some editors (Sublime Text Editor) outright refuse to open the file and view it as unreadable/corrupted byte code. Suffice to say working with such a file is somewhat difficult...

      Attachments

        1. CSTA17.pdf
          1.63 MB
          Oleksii Zinkovskyi
        2. PDFBOX-4036-reduced.pdf
          73 kB
          Tilman Hausherr

        Activity

          People

            Unassigned Unassigned
            Oleksii Zinkovskyi Oleksii Zinkovskyi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: