[PDFBOX-4749] Text Extraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS' - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.18
Fix Version/s: 3.0.0 PDFBox
Component/s: Text extraction
Labels:
None

Description

I consider the attached PDF. I consider the text on the first page:

"Am Fährweg"

It appears the code for the first character 'A' is 65 and is parsed correctly, while the code for the fourth character 'F' is 70 which is parsed as a 'c'.

org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(int) relies on a CMap named 'AdHoc-UCS' which mapping in :

{129=ü, 3= , 8=%, 9=&, 11=(, 12=), 15=,, 16=-, 17=., 18=/, 19=0, 20=1, 21=2, 22=3, 23=4, 24=5, 25=6, 26=7, 27=8, 28=9, 29=:, 34=?, 36=A, 37=B, 38=C, 39=D, 40=E, 41=F, 42=G, 43=H, 44=I, 46=K, 47=L, 48=M, 49=N, 50=O, 51=P, 53=R, 54=S, 55=T, 56=U, 57=V, 58=W, 59=X, 61=Z, 68=a, 69=b, 70=c, 71=d, 72=e, 73=f, 74=g, 75=h, 76=i, 78=k, 79=l, 80=m, 81=n, 82=o, 83=p, 85=r, 86=s, 87=t, 88=u, 89=v, 90=w, 93=z, 95=|, 108=ä, 124=ö}

-> 'A' is parsed as 'A' as it is out of the mapping of CMap, while 'F' conflicts the entry mapping 70 to c.

The document is correctly parsed in Acrobat Reader.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
26/Apr/20 14:03
5 kB
Tilman Hausherr
PDFBOX-4749-reduced.pdf
22/Jan/20 18:49
6 kB
Tilman Hausherr

Issue Links

links to

Stackoverflow - PDFBox 2.0.7 ExtractText not working but 1.8.13 does and PDFReader as well

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Benoit Lacelle

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 21/Jan/20 09:50

Updated:: 18/Aug/23 05:46

Resolved:: 03/Nov/20 16:24