Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4277

PDF parse issue for text rotated

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 3.0.0-BETA, 2.9.2
    • None
    • tika-app, tika-server
    • Important

    Description

      the incorrect result parsed by Tika and Tika Server 2.9.2 and 3.0beta

      The attached PDF cannot be correctly parsed by Tika 2.9.2 and 3.0beta, in server version and the standalone.

      if the text is rotated 90. The parsed result will have a line break after each letter of word. It happened to symbol, English letters, and JCK characters.

      In the server version, curl -g -T "sample2.pdf"
      http://localhost:889/tika
      --header "Accept: text/plain"

      In the standalone version, java.exe -jar "C:\TikaSearch\tika-app-2.9.2.jar" --text

      Both of above, deliver the the incorrect result in the attached pdf.

      The output result is below

      i
      n
      s
      e
      r
      t
       
      t
      e
      x
      t
       
      p
      r
      o
      b
      l
      e
      m

      insert text problem

      Attachments

        1. sample2.pdf
          11 kB
          ragebear
        2. OtherPDFReader.png
          326 kB
          ragebear

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ragebear ragebear
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: