Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1597

RTF with embedded image parsing produces div before html

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.7
    • 1.8
    • None
    • None
    • linux, oracle jdk 7u75

    Description

      On tika-1.8-rc1.

      java -jar tika-app/target/tika-app-1.8.jar -x 2.rtf returns

      <?xml version="1.0" encoding="UTF-8"?><div xmlns="http://www.w3.org/1999/xhtml">HOHcvanAHTI'Imoc
      v8 Hanemnan npfiBOBafi "DRAW
      
      </div>
      <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <!-- tail omitted -->
      

      Removing image prevents such behavior (3.rtf doesn't contain embedded image).

      Update: you should have tesseract installed to reproduce this issue.

      Attachments

        1. 2.rtf
          67 kB
          Konstantin Gribov
        2. 3.rtf
          2 kB
          Konstantin Gribov

        Activity

          People

            Unassigned Unassigned
            grossws Konstantin Gribov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: