Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4210

Unable to extract the text from a PDF ("No Unicode mapping.." warnings)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.9
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from PDF.

      I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can extract the text (although not with a 100% precision).

      Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox 2.0.9 doesn't return anything.

      As you can see from the warning, the font in question is ArialMT. It is custom encoding and the pdf doesn't include toUnicode mapping. Font type is CID TrueType (this info is provided by "pdffonts")

      "pdftotext" also can't extract anything but only shows an error `Syntax Error: Unknown character collection 'Adobe-ArialMT'`

      The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.

      I would like to determine whether there is a bug in pdfbox or the pdf producer has to adjust and improve the "readability" of pdf.

       

       

        Attachments

        1. Testdokument.pdf
          76 kB
          Aleksandar Putnik
        2. Testdokument-AcrobatOCR.txt
          2 kB
          Maruan Sahyoun
        3. Testdokument-AcrobatOCR.pdf
          419 kB
          Maruan Sahyoun

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aputnik Aleksandar Putnik
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: