Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4720

cmap entries "<0000> <FFFF> <0000>" are cut

    XMLWordPrintableJSON

Details

    Description

      Parts of the text extraction is lost in the attached files because of the ToUnicode stream. Entries like

      <0000> <FFFF> <0000> 

      are cut at 256 elements, likely from PDFBOX-4661 and previous ones. While such an entry is incorrect, I think it should still be accepted when it's exactly that one.

      Several such files have popped up in the last regression tests; I analysed only this one but it explains why I saw so many "foreign" differences: the ascii codes are OK, but not the "very special" characters.

      Attachments

        1. OLZD4CBMION7ZUV3M62BYWWIQJIS3IA3.pdf
          422 kB
          Tilman Hausherr
        2. OLZD4CBMION7ZUV3M62BYWWIQJIS3IA3-reduced.pdf
          180 kB
          Tilman Hausherr

        Activity

          People

            lehmi Andreas Lehmkühler
            tilman Tilman Hausherr
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: