Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.25
-
None
Description
The next PDFBox version identifies soft-hyphens (00AD) and returns them as such. Tika-eval swallows them, thus reporting differences. This can be shown with the file attached to PDFBOX-5115 in "Max-Planck-Institut" and in the attached excel file in line 4.
Proposed change:
add
"\u00AD" => "-"
to
lucene-char-mapping.txt
Attachments
Attachments
Issue Links
- relates to
-
PDFBOX-5115 U+00AD ('sfthyphen') is not available in this font Times-Roman encoding: WinAnsiEncoding
-
- Closed
-