Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
Description
The PDF document at http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf produces invalid characters in the output when filtered by Tika 1.4.
> /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea… …d -n 40 ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2' <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> <head> [snip] <p>Cycle network </p> <p> </p> <p>HILEY </p>
Is there any proper way to avoid this, or is the best approach to strip such characters from Tika's output?