The next PDFBox version identifies soft-hyphens (00AD) and returns them as such. Tika-eval swallows them, thus reporting differences. This can be shown with the file attached to
PDFBOX-5115 in "Max-Planck-Institut" and in the attached excel file in line 4.
"\u00AD" => "-"