Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-796

Tika breaks words of rotated text in PDF documents

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 0.10, 1.0
    • None
    • parser
    • Windows 7 Professional x64, Java(TM) SE Runtime Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)

    Description

      When Tika extracts text from a PDF file, rotated texts are extracted in a way that words are broken. Apparently the number of lines of a rotated paragraph seems to be the number of characters after which Tika breaks the words apart with a line feed (0x0a) character.

      Steps to reproduce this issue (in this example, on a Windows machine):

      • Download the following pdf file: http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf, e.g. to C:\temp\
      • Open a console window and run tika with: java -jar tika-app.jar -t "file:///c:/temp/energieberatung.pdf" > test.txt
      • Have a look at the text file, e.g. with a hex editor and note the words broken in 2-character-pieces: <char1><char2><LF>

      This problems seems to be introduced with Tika 0.10, it still exists with Tika 1.0.

      Attachments

        Activity

          People

            Unassigned Unassigned
            cfcf Franz Canaval
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: