Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.21
-
None
-
None
-
Command line to reproduce:
java -jar tika-app.jar --xml Impress.odp
Description
The XHTML output produced by the Tika 1.21 is invalid for some LibreOffice documents. The sample document (created in LibreOffice 6.1.5) is attached.
Here is the sample output (the <p> tag is not closed, any XHTML parser will fail to parse that):
<p class="notes"><div/>
</notes><div><p>SECOND PAGE</p>
</div>
<div><ul> <li><p>Text on the second page</p>
</li>
</ul>
</div>
<p class="notes"><div/>
</notes></body></html>
Thanks!