Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4720

cmap entries "<0000> <FFFF> <0000>" are cut

    XMLWordPrintableJSON

    Details

      Description

      Parts of the text extraction is lost in the attached files because of the ToUnicode stream. Entries like

      <0000> <FFFF> <0000> 

      are cut at 256 elements, likely from PDFBOX-4661 and previous ones. While such an entry is incorrect, I think it should still be accepted when it's exactly that one.

      Several such files have popped up in the last regression tests; I analysed only this one but it explains why I saw so many "foreign" differences: the ascii codes are OK, but not the "very special" characters.

        Attachments

        1. OLZD4CBMION7ZUV3M62BYWWIQJIS3IA3.pdf
          422 kB
          Tilman Hausherr
        2. OLZD4CBMION7ZUV3M62BYWWIQJIS3IA3-reduced.pdf
          180 kB
          Tilman Hausherr

          Activity

            People

            • Assignee:
              lehmi Andreas Lehmkühler
              Reporter:
              tilman Tilman Hausherr
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: