[TIKA-4124] embedded html of type http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk is not parsed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.9.1
Component/s: parser
Labels:
None

Description

Word documents that may have been created using third party programs such as docx4j sometimes contain embedded html. This is not parsed by Tika. The embedded HTML file usually resides within the main folder of the docx internal structure.

Changing the code in: org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart()

as follows, handles this (the final else if)

if (POIXMLDocument.OLE_OBJECT_REL_TYPE.equals(type) && TYPE_OLE_OBJECT.equals(target.getContentType())) {

handleEmbeddedOLE(target, xhtml, sourceDesc + rel.getId(), parentMetadata);

if (targetURI != null) {

handledTarget.add(targetURI.toString());

}

} else if (RELATION_MEDIA.equals(type) || RELATION_VIDEO.equals(type) || RELATION_AUDIO.equals(type)

|| PackageRelationshipTypes.IMAGE_PART.equals(type) || POIXMLDocument.PACK_OBJECT_REL_TYPE.equals(type)

|| POIXMLDocument.OLE_OBJECT_REL_TYPE.equals(type)) {

handleEmbeddedFile(target, xhtml, sourceDesc + rel.getId());

if (targetURI != null) {

handledTarget.add(targetURI.toString());

}

} else if (XSSFRelation.VBA_MACROS.getRelation().equals(type)) {

handleMacros(target, xhtml);

if (targetURI != null) {

handledTarget.add(targetURI.toString());

}

} else if (type.endsWith("aFChunk")) {

handleEmbeddedFile(target, xhtml, sourceDesc + rel.getId());

}

Attachments

Issue Links

links to

GitHub Pull Request #1324

Activity

People

Assignee:: Unassigned

Reporter:: Tim Barrett

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Sep/23 08:38

Updated:: 11/Sep/23 21:48

Resolved:: 11/Sep/23 20:10