Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2192

Extract embedded files from headers, footers, footnotes, etc from docx/m

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      While working on an alternate SAX parser for docx/docm, I found that we're not currently extracting embedded documents from headers, footers, footnotes, endnotes or comments. We should fix this in our classic DOM parser.

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1150 (See https://builds.apache.org/job/Tika-trunk/1150/)
        TIKA-2192 - add extraction of embedded objects in DOM docx parser from (tallison: rev 615bf75fc11e8fc299be550b8cd4bb24f45a264a)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          update changes for TIKA-2191 and TIKA-2192 (tallison: rev 5425d02a1ed97ce5f884a076f55ad8197cc6ac7b)
        • (edit) CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1150 (See https://builds.apache.org/job/Tika-trunk/1150/ ) TIKA-2192 - add extraction of embedded objects in DOM docx parser from (tallison: rev 615bf75fc11e8fc299be550b8cd4bb24f45a264a) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java update changes for TIKA-2191 and TIKA-2192 (tallison: rev 5425d02a1ed97ce5f884a076f55ad8197cc6ac7b) (edit) CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #194 (See https://builds.apache.org/job/tika-2.x/194/)
        TIKA-2192 (tallison: rev e02084cc64c5a825dae6e16853c5dac3cbb55f46)

        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #194 (See https://builds.apache.org/job/tika-2.x/194/ ) TIKA-2192 (tallison: rev e02084cc64c5a825dae6e16853c5dac3cbb55f46) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development