[TIKA-548] PDF content extracted as single line - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8
Fix Version/s: 0.9
Component/s: parser
Labels:
None

Description

Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs from the PDF are no longer separated by newline. This is a problem both for reading and for indexing. See the attached test.

Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line. Here's from a sample file with a headline followed by a one word paragraph:
$> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
1 - untitled 3 - 2010-02-13 09:52 - Staffan Olsson
PDF Title For Short Document
veryshortpdfcontents

But Tika prints:
$> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
...
<p>1 - untitled 3 - 2010-02-13 09:52 - Staffan OlssonPDF
Title For Short Documentveryshortpdfcontents</p>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

tika-PDF-content-regression-test.patch
11/Nov/10 19:39
1 kB
Staffan Olsson
test.pdf
28/Nov/10 21:29
37 kB
Reinhard Pötz

Issue Links

is duplicated by

TIKA-583 Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file

Resolved

TIKA-584 Tika parse of some PDF files removes all spaces between words

Resolved

Activity

People

Assignee:: Jukka Zitting

Reporter:: Staffan Olsson

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Nov/10 19:38

Updated:: 02/Aug/12 09:33

Resolved:: 18/Nov/10 18:11