[TIKA-1174] Invalid characters in filtered PDF output - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: parser
Labels:
None
Environment:

Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)

Description

The PDF document at http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf produces invalid characters in the output when filtered by Tika 1.4.

>
/opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | hea…
…d -n 40
ERROR - Error: Could not parse predefined CMAP file for 'nullžf °-ˇžl,¡ì$1-UCS2'
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>


[snip]

<p>Cycle network
</p>
<p>
</p>
<p>HILEY

</p>

Is there any proper way to avoid this, or is the best approach to strip such characters from Tika's output?

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

map_sp_1c_a4.pdf
20/Sep/13 05:31
228 kB
Matt Sheppard

Activity

People

Assignee:: Unassigned

Reporter:: Matt Sheppard

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Sep/13 05:30

Updated:: 15/Mar/15 21:01