Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-4124

embedded html of type http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk is not parsed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.9.1
    • parser
    • None

    Description

      Word documents that may have been created using third party programs such as docx4j sometimes contain embedded html. This is not parsed by Tika. The embedded HTML file usually resides within the main folder of the docx internal structure.

      Changing the code in: org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedPart()

      as follows, handles this (the final else if)

       
      if (POIXMLDocument.OLE_OBJECT_REL_TYPE.equals(type) && TYPE_OLE_OBJECT.equals(target.getContentType())) {

      handleEmbeddedOLE(target, xhtml, sourceDesc + rel.getId(), parentMetadata);

      if (targetURI != null) {

      handledTarget.add(targetURI.toString());

      }

      } else if (RELATION_MEDIA.equals(type) || RELATION_VIDEO.equals(type) || RELATION_AUDIO.equals(type)

      || PackageRelationshipTypes.IMAGE_PART.equals(type) || POIXMLDocument.PACK_OBJECT_REL_TYPE.equals(type)

      || POIXMLDocument.OLE_OBJECT_REL_TYPE.equals(type)) {

      handleEmbeddedFile(target, xhtml, sourceDesc + rel.getId());

      if (targetURI != null) {

      handledTarget.add(targetURI.toString());

      }

      } else if (XSSFRelation.VBA_MACROS.getRelation().equals(type)) {

      handleMacros(target, xhtml);

      if (targetURI != null) {

      handledTarget.add(targetURI.toString());

      }

      } else if (type.endsWith("aFChunk")) {

       

      handleEmbeddedFile(target, xhtml, sourceDesc + rel.getId());

       

      }

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              comcortim Tim Barrett
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: