[TIKA-911] Converted PDF document contains question marks in place of spaces and inconsistent case - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.8
Fix Version/s: None
Component/s: parser
Labels:
None

Description

The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf, when converted with tika v1.1 using

$ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf

Produces substantially worse output than xpdf's pdftotext program.

Specifically, we see...

Some 'spaces' replaced with question marks

...
<body><div class="page"><p/>
<p>How can I help?
When you're overseas:
• ?wherever?possible,?don't?visit?crops?—?contact?with?
</p>
<p>growing?crops?greatly?increases?the?risk?of?contaminating?
footwear?or?clothing;?
...

and some odd case conversions

<p>stem rust in wheat.  
 (soURce: BRAd collIs)</p>
<p/>
</div>

(The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.

To compare that with pdftotext

$ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf

This does not output the question marks, and produces "Source: BRAD COLLIS" at the end there, both of which seem to be improvements. Note that it does, however, produce a number of ^G characters which are not desireable.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Rust Biosecurity Brochure.pdf
02/May/12 04:37
738 kB
Matt Sheppard
Rust Biosecurity Brochure.pdf.html
02/May/12 12:15
6 kB
Matt Sheppard

Activity

People

Assignee:: Unassigned

Reporter:: Matt Sheppard

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/May/12 04:36

Updated:: 02/Mar/15 04:26