Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2897

Invalid XHTML output for some OpenOffice files (created in LibreOffice Impress)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.21
    • None
    • parser
    • None
    • Command line to reproduce:

      java -jar tika-app.jar --xml Impress.odp

    Description

      The XHTML output produced by the Tika 1.21 is invalid for some LibreOffice documents. The sample document (created in LibreOffice 6.1.5) is attached.

      Here is the sample output (the <p> tag is not closed, any XHTML parser will fail to parse that):

      <p class="notes"><div/>
      </notes><div><p>SECOND PAGE</p>
      </div>
      <div><ul> <li><p>Text on the second page</p>
      </li>
      </ul>
      </div>
      <p class="notes"><div/>
      </notes></body></html>

       

      Thanks!

      Attachments

        1. Impress.odp
          312 kB
          Funbit

        Activity

          People

            Unassigned Unassigned
            Funbit Funbit
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: