Details
Description
When Tika extracts text from a PDF file, rotated texts are extracted in a way that words are broken. Apparently the number of lines of a rotated paragraph seems to be the number of characters after which Tika breaks the words apart with a line feed (0x0a) character.
Steps to reproduce this issue (in this example, on a Windows machine):
- Download the following pdf file: http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf, e.g. to C:\temp\
- Open a console window and run tika with: java -jar tika-app.jar -t "file:///c:/temp/energieberatung.pdf" > test.txt
- Have a look at the text file, e.g. with a hex editor and note the words broken in 2-character-pieces: <char1><char2><LF>
This problems seems to be introduced with Tika 0.10, it still exists with Tika 1.0.