[PDFBOX-2532] Text extraction fails due to the usage of the internal font mapping - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0
Fix Version/s: 4.0.0
Component/s: Text extraction
Labels:
None

Description

If a pdf doesn't provide any mapping (neither an encoding nor a toUnicode mapping) we have to decide where to get a suitable mapping ourselves. We can't use the internal font mapping of the type1C font as it doesn't work in every case, see ~~PDFBOX-2377~~ which provides a solution for the 1.8-branch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX2247-Debugger.png
03/Dec/14 07:50
137 kB
Andreas Lehmkühler
PDFBOX2247-701542.pdf
29/Nov/14 18:45
242 kB
Andreas Lehmkühler
PDFBOX2247-701542_sa_reader_osx.txt
02/Dec/14 08:33
7 kB
Maruan Sahyoun
PDFBOX2247-701542_sa_acrobat.txt
02/Dec/14 07:59
7 kB
Andreas Lehmkühler
PDFBOX2247-701542_sa_acrobat_osx.txt
02/Dec/14 08:33
7 kB
Maruan Sahyoun
PDFBOX2247-701542_cp_acrobat.txt
02/Dec/14 07:59
7 kB
Andreas Lehmkühler

Issue Links

is related to

PDFBOX-3066 Text extraction garbled in this file, was OK in 1.8

Open

Activity

People

Assignee:: Unassigned

Reporter:: Andreas Lehmkühler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Nov/14 18:45

Updated:: 28/Dec/20 14:11