[TIKA-584] Tika parse of some PDF files removes all spaces between words - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.8
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Windows XP 3, OpenSuse 11.2

Description

In the case of some pdf files (not all), when Tika.parse(InputStream) method is used, the content extracted from the returned reader has all spaces removed. This only happens for some PDF files: An example where this happens is: JavaEE6Tutorial.pdf (available from Oracle). There are many such files where this bug can be seen. I have even tried Tika snapshot 0.9 and the bug remains.

When PDFTextStripper is directly used, the extracted content is correct, with the spaces between words retained.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

JavaEE6Tutorial.pdf
17/Jan/11 03:43
5.30 MB
Ajay Vohra

Issue Links

duplicates

TIKA-548 PDF content extracted as single line

Closed

Activity

People

Assignee:: Jukka Zitting

Reporter:: Ajay Vohra

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 15/Jan/11 13:09

Updated:: 19/Jan/11 12:58

Resolved:: 19/Jan/11 12:58