[PDFBOX-4210] Unable to extract the text from a PDF ("No Unicode mapping.." warnings) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.9
Fix Version/s: None
Component/s: Text extraction
Labels:
None

Description

I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from PDF.

I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can extract the text (although not with a 100% precision).

Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox 2.0.9 doesn't return anything.

As you can see from the warning, the font in question is ArialMT. It is custom encoding and the pdf doesn't include toUnicode mapping. Font type is CID TrueType (this info is provided by "pdffonts")

"pdftotext" also can't extract anything but only shows an error `Syntax Error: Unknown character collection 'Adobe-ArialMT'`

The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.

I would like to determine whether there is a bug in pdfbox or the pdf producer has to adjust and improve the "readability" of pdf.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Testdokument-AcrobatOCR.pdf
08/May/18 13:16
419 kB
Maruan Sahyoun
Testdokument-AcrobatOCR.txt
08/May/18 13:16
2 kB
Maruan Sahyoun
Testdokument.pdf
08/May/18 07:21
76 kB
Aleksandar Putnik

Activity

People

Assignee:: Unassigned

Reporter:: Aleksandar Putnik

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/May/18 07:20

Updated:: 08/May/18 16:25