Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.18
-
None
Description
I consider the attached PDF. I consider the text on the first page:
"Am Fährweg"
It appears the code for the first character 'A' is 65 and is parsed correctly, while the code for the fourth character 'F' is 70 which is parsed as a 'c'.
org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(int) relies on a CMap named 'AdHoc-UCS' which mapping in :
{129=ü, 3= , 8=%, 9=&, 11=(, 12=), 15=,, 16=-, 17=., 18=/, 19=0, 20=1, 21=2, 22=3, 23=4, 24=5, 25=6, 26=7, 27=8, 28=9, 29=:, 34=?, 36=A, 37=B, 38=C, 39=D, 40=E, 41=F, 42=G, 43=H, 44=I, 46=K, 47=L, 48=M, 49=N, 50=O, 51=P, 53=R, 54=S, 55=T, 56=U, 57=V, 58=W, 59=X, 61=Z, 68=a, 69=b, 70=c, 71=d, 72=e, 73=f, 74=g, 75=h, 76=i, 78=k, 79=l, 80=m, 81=n, 82=o, 83=p, 85=r, 86=s, 87=t, 88=u, 89=v, 90=w, 93=z, 95=|, 108=ä, 124=ö}-> 'A' is parsed as 'A' as it is out of the mapping of CMap, while 'F' conflicts the entry mapping 70 to c.
The document is correctly parsed in Acrobat Reader.