Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-796

Tika breaks words of rotated text in PDF documents

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.10, 1.0
    • Fix Version/s: None
    • Component/s: parser
    • Environment:

      Windows 7 Professional x64, Java(TM) SE Runtime Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)

      Description

      When Tika extracts text from a PDF file, rotated texts are extracted in a way that words are broken. Apparently the number of lines of a rotated paragraph seems to be the number of characters after which Tika breaks the words apart with a line feed (0x0a) character.

      Steps to reproduce this issue (in this example, on a Windows machine):

      • Download the following pdf file: http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf, e.g. to C:\temp\
      • Open a console window and run tika with: java -jar tika-app.jar -t "file:///c:/temp/energieberatung.pdf" > test.txt
      • Have a look at the text file, e.g. with a hex editor and note the words broken in 2-character-pieces: <char1><char2><LF>

      This problems seems to be introduced with Tika 0.10, it still exists with Tika 1.0.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              cfcf Franz Canaval
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: