Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4250

PDF File with embedded fonts: text extraction fails or returns junk characters

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Not A Bug
    • Affects Version/s: 2.0.9
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

       One of the people that I support created a PDF file from an LibreOffice document, and then misplaced the original document. I believed that I could use PDFBox to extract the text from the PDF, and at least provide that information to the user.

      
When I ran the text extractor from the "app" jar, on their PDF file I got the  following types of messages (many):
      
...
      Jun 13, 2018 5:38:43 PM
      org.apache.pdfbox.pdmodel.font.PDSimple
      ont toUnicode
      WARNING: No Unicode mapping for 7 (7) in
      font EXIRGE+Ubuntu
      Jun 13, 2018 5:38:43 PM
      org.apache.pdfbox.pdmodel.font.PDSimpleont toUnicode
      WARNING: No Unicode mapping for 8 (8) in
      font EXIRGE+Ubuntu
      Jun 13, 2018 5:38:43 PM
      org.apache.pdfbox.pdmodel.font.PDSimple
      ont toUnicode
      WARNING: No Unicode mapping for 1 (1) in
      font JTPICY+AndaleMono
      Jun 13, 2018 5:38:43 PM
      org.apache.pdfbox.pdmodel.font.PDSimple
      ont toUnicode
      ...
      
The resulting "txt" file is just binary numbers, unless the font is one of the "standard". I ran
      the debugger on the PDF file and saw that several fonts were embedded, and thus used low numbers for encoding (1,2,3, etc).

      
When viewed, the PDF file looks good, but nothing can be copied or pasted from the display (again,standard font seems OK).

      
The original file was of a sensitive nature, so I was able to re-create the problem with a simpler file.

      
Running on Ubuntu 16.04

      LibreOffice was used to "print" on the cups-pdf "printer" (which may  be part of the problem).

      
Text extract was attempted with pdfbox-app-2.0.9.jar

      
PDF file is at:

      http://swansongrp.com/misc/mytest3.pdf

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              wwi Bob Swanson
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: