Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
2.0.16, 2.0.27, 3.0.0 PDFBox
-
Same on Windows, Linux and macOS
Description
Using PDFBox as part of Tika and having issues with some PDFs outputting unreadable content. Copying text from Adobe / macOS Preview / Browsers works as expected.
I have also tried "re-encoding" the PDF by editing and saving it with Acrobat, thinking it could be an issue with their original PDF creator and using pdfbox with different encodings, but output mostly remained unchanged.
I attached the PDF and text it produces. Running it PDFBox via CLI as follows:
root % java -jar pdfbox-app-3.0.0-alpha3.jar export:text -i test.pdf Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Invalid ToUnicode CMap in font Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Using predefined identity CMap instead Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Invalid ToUnicode CMap in font Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Using predefined identity CMap instead Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Invalid ToUnicode CMap in font Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Using predefined identity CMap instead Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Invalid ToUnicode CMap in font Nov 06, 2022 9:12:47 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap WARNUNG: Using predefined identity CMap instead