Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
1.13
-
None
-
None
-
org.apache.tika.parser.microsoft.OfficeParser and org.apache.tika.parser.microsoft.ooxml.OOXMLParser
Description
I'm converting Excel files, both .xls and .xlsx.
.xls uses org.apache.tika.parser.microsoft.OfficeParser and
.xlsx uses org.apache.tika.parser.microsoft.ooxml.OOXMLParser
If I have a link in my excel document, for example santa@gmail.com, the .xls parser adds additional elements in the document structure which shows an incorrect output of how the document looks.
For example, this table in file.xls:
mailadress password
santa@gmail.com hohoho
will output:
<div class="page">
<h1>Sheet1</h1>
<table>
<tbody>
<tr>
<td>mailadress</td>
<td>password</td>
</tr>
<tr>
<td>santa@gmail.com</td>
<td>hohoho</td>
</tr>
</tbody>
</table>
<div class="outside">
<a href="mailto:santa@gmail.com">santa@gmail.com</a>
</div>
</div>
The <div class="outside"> should be removed because it does not correspond to the document structure.