-
Type:
Bug
-
Status: Resolved
-
Priority:
Major
-
Resolution: Duplicate
-
Affects Version/s: 0.10, 1.0
-
Fix Version/s: None
-
Component/s: parser
-
Environment:
Windows 7 Professional x64, Java(TM) SE Runtime Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)
When Tika extracts text from a PDF file, rotated texts are extracted in a way that words are broken. Apparently the number of lines of a rotated paragraph seems to be the number of characters after which Tika breaks the words apart with a line feed (0x0a) character.
Steps to reproduce this issue (in this example, on a Windows machine):
- Download the following pdf file: http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf, e.g. to C:\temp\
- Open a console window and run tika with: java -jar tika-app.jar -t "file:///c:/temp/energieberatung.pdf" > test.txt
- Have a look at the text file, e.g. with a hex editor and note the words broken in 2-character-pieces: <char1><char2><LF>
This problems seems to be introduced with Tika 0.10, it still exists with Tika 1.0.