[TIKA-796] Tika breaks words of rotated text in PDF documents - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.10, 1.0
Fix Version/s: None
Component/s: parser
Labels:
- broken
- linefeed
- pdf
- rotated
- text
- words
Environment:

Windows 7 Professional x64, Java(TM) SE Runtime Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)

Description

When Tika extracts text from a PDF file, rotated texts are extracted in a way that words are broken. Apparently the number of lines of a rotated paragraph seems to be the number of characters after which Tika breaks the words apart with a line feed (0x0a) character.

Steps to reproduce this issue (in this example, on a Windows machine):

Download the following pdf file: http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf, e.g. to C:\temp\
Open a console window and run tika with: java -jar tika-app.jar -t "file:///c:/temp/energieberatung.pdf" > test.txt
Have a look at the text file, e.g. with a hex editor and note the words broken in 2-character-pieces: <char1><char2><LF>

This problems seems to be introduced with Tika 0.10, it still exists with Tika 1.0.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Franz Canaval

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 01/Dec/11 11:20

Updated:: 20/Jan/12 08:54

Resolved:: 20/Jan/12 08:54