Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4749

Text Extraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.18
    • Fix Version/s: None
    • Component/s: Text extraction
    • Labels:
      None

      Description

      I consider the attached PDF. I consider the text on the first page:

      "Am Fährweg"

      It appears the code for the first character 'A' is 65 and is parsed correctly, while the code for the fourth character 'F' is 70 which is parsed as a 'c'.

      org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(int) relies on a CMap named 'AdHoc-UCS' which mapping in :

      {129=ü, 3= , 8=%, 9=&, 11=(, 12=), 15=,, 16=-, 17=., 18=/, 19=0, 20=1, 21=2, 22=3, 23=4, 24=5, 25=6, 26=7, 27=8, 28=9, 29=:, 34=?, 36=A, 37=B, 38=C, 39=D, 40=E, 41=F, 42=G, 43=H, 44=I, 46=K, 47=L, 48=M, 49=N, 50=O, 51=P, 53=R, 54=S, 55=T, 56=U, 57=V, 58=W, 59=X, 61=Z, 68=a, 69=b, 70=c, 71=d, 72=e, 73=f, 74=g, 75=h, 76=i, 78=k, 79=l, 80=m, 81=n, 82=o, 83=p, 85=r, 86=s, 87=t, 88=u, 89=v, 90=w, 93=z, 95=|, 108=ä, 124=ö}

      -> 'A' is parsed as 'A' as it is out of the mapping of CMap, while 'F' conflicts the entry mapping 70 to c.

      The document is correctly parsed in Acrobat Reader.

        Attachments

        1. PDFBOX-4749-reduced.pdf
          6 kB
          Tilman Hausherr

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              blasd Benoit Lacelle
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: