Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4749

Text Extraction leads to weird result - toUnicodeCMap is 'AdHoc-UCS'

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.18
    • 3.0.0 PDFBox
    • Text extraction
    • None

    Description

      I consider the attached PDF. I consider the text on the first page:

      "Am Fährweg"

      It appears the code for the first character 'A' is 65 and is parsed correctly, while the code for the fourth character 'F' is 70 which is parsed as a 'c'.

      org.apache.pdfbox.pdmodel.font.PDFont.toUnicode(int) relies on a CMap named 'AdHoc-UCS' which mapping in :

      {129=ü, 3= , 8=%, 9=&, 11=(, 12=), 15=,, 16=-, 17=., 18=/, 19=0, 20=1, 21=2, 22=3, 23=4, 24=5, 25=6, 26=7, 27=8, 28=9, 29=:, 34=?, 36=A, 37=B, 38=C, 39=D, 40=E, 41=F, 42=G, 43=H, 44=I, 46=K, 47=L, 48=M, 49=N, 50=O, 51=P, 53=R, 54=S, 55=T, 56=U, 57=V, 58=W, 59=X, 61=Z, 68=a, 69=b, 70=c, 71=d, 72=e, 73=f, 74=g, 75=h, 76=i, 78=k, 79=l, 80=m, 81=n, 82=o, 83=p, 85=r, 86=s, 87=t, 88=u, 89=v, 90=w, 93=z, 95=|, 108=ä, 124=ö}

      -> 'A' is parsed as 'A' as it is out of the mapping of CMap, while 'F' conflicts the entry mapping 70 to c.

      The document is correctly parsed in Acrobat Reader.

      Attachments

        1. PDFBOX-4749-reduced.pdf
          6 kB
          Tilman Hausherr
        2. screenshot-1.png
          5 kB
          Tilman Hausherr

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            lehmi Andreas Lehmkühler
            blasd Benoit Lacelle
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment