Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Bug
-
None
-
None
-
None
Description
Extracted text from some PDF files includes some strings with repeated (doubled) characters.
To reproduce the problem, download attached PDF file and run the following command:
java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep '(.)\1(.)\2'
The bad strings all seem to be headings, so perhaps something in the font or other style features are causing the problem.
First detected in version 1.19, retested with 1.25. Did not test earlier versions.