Description
Word documents that may have been created using third party programs such as docx4j sometimes contain embedded html. This is not parsed by Tika. The embedded HTML file usually resides within the main folder of the docx internal structure.
Changing the code in: org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart()
as follows, handles this (the final else if)
if (POIXMLDocument.OLE_OBJECT_REL_TYPE.equals(type) && TYPE_OLE_OBJECT.equals(target.getContentType())) {
handleEmbeddedOLE(target, xhtml, sourceDesc + rel.getId(), parentMetadata);
if (targetURI != null) {
handledTarget.add(targetURI.toString());
}
} else if (RELATION_MEDIA.equals(type) || RELATION_VIDEO.equals(type) || RELATION_AUDIO.equals(type)
|| PackageRelationshipTypes.IMAGE_PART.equals(type) || POIXMLDocument.PACK_OBJECT_REL_TYPE.equals(type)
|| POIXMLDocument.OLE_OBJECT_REL_TYPE.equals(type)) {
handleEmbeddedFile(target, xhtml, sourceDesc + rel.getId());
if (targetURI != null) {
handledTarget.add(targetURI.toString());
}
} else if (XSSFRelation.VBA_MACROS.getRelation().equals(type)) {
handleMacros(target, xhtml);
if (targetURI != null) {
handledTarget.add(targetURI.toString());
}
} else if (type.endsWith("aFChunk")) {
handleEmbeddedFile(target, xhtml, sourceDesc + rel.getId());
}
Attachments
Issue Links
- links to