Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.0.17
Description
Parts of the text extraction is lost in the attached files because of the ToUnicode stream. Entries like
<0000> <FFFF> <0000>
are cut at 256 elements, likely from PDFBOX-4661 and previous ones. While such an entry is incorrect, I think it should still be accepted when it's exactly that one.
Several such files have popped up in the last regression tests; I analysed only this one but it explains why I saw so many "foreign" differences: the ascii codes are OK, but not the "very special" characters.