Parts of the text extraction is lost in the attached files because of the ToUnicode stream. Entries like
are cut at 256 elements, likely from
PDFBOX-4661 and previous ones. While such an entry is incorrect, I think it should still be accepted when it's exactly that one.
Several such files have popped up in the last regression tests; I analysed only this one but it explains why I saw so many "foreign" differences: the ascii codes are OK, but not the "very special" characters.