Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2026

Handle OLE 2.0 embedded non-Office document in PPT/X and XLSX

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.14
    • Component/s: None
    • Labels:
      None

      Description

      When some files (e.g. pdfs) are embedded in XLSX, PPT and PPTX, they are wrapped in an OLE compobj. In TIKA-704, we added handling for these types of embedded files in DOC/DOCX files. We need to make a few modifications to extract these in XLSX, PPT and PPTX.

      1. oleObject1.bin
        38 kB
        Tim Allison
      2. testEmbedded3.pptx
        106 kB
        Tim Allison

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        container file with one embedded comp-obj

        Show
        tallison@mitre.org Tim Allison added a comment - container file with one embedded comp-obj
        Hide
        tallison@mitre.org Tim Allison added a comment -

        This may be a duplicate issue...but I can't find the original quickly...

        Show
        tallison@mitre.org Tim Allison added a comment - This may be a duplicate issue...but I can't find the original quickly...
        Hide
        tallison@mitre.org Tim Allison added a comment -

        These embedded OLE objects are handled correctly for docx files (TIKA-704).

        We're currently checking for:

                    if (root.hasEntry("CONTENTS")
                            && root.hasEntry("\u0001Ole")
                            && root.hasEntry("\u0001CompObj")
                            && root.hasEntry("\u0003ObjInfo")) {
        

        However, pptx don't appear to have an ObjInfo:

        OlePres000
        Ole
        CompObj
        CONTENTS
        
        Show
        tallison@mitre.org Tim Allison added a comment - These embedded OLE objects are handled correctly for docx files ( TIKA-704 ). We're currently checking for: if (root.hasEntry("CONTENTS") && root.hasEntry("\u0001Ole") && root.hasEntry("\u0001CompObj") && root.hasEntry("\u0003ObjInfo")) { However, pptx don't appear to have an ObjInfo: OlePres000 Ole CompObj CONTENTS
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Tika-trunk #1074 (See https://builds.apache.org/job/Tika-trunk/1074/)
        TIKA-2026 – improve extraction of embedded files from ppt, pptx and (tallison: rev 7cc610e1b3f164fe9de00b1a35e60fd00a69bb46)

        • tika-parsers/src/test/resources/test-documents/testExcel_embeddedPDF.xls
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
        • CHANGES.txt
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • tika-parsers/src/test/resources/test-documents/testPPT_EmbeddedPDF.pptx
        • tika-parsers/src/test/resources/test-documents/testExcel_embeddedPDF.xlsx
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java
        • tika-parsers/src/test/resources/test-documents/testPPT_EmbeddedPDF.ppt
        • tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Tika-trunk #1074 (See https://builds.apache.org/job/Tika-trunk/1074/ ) TIKA-2026 – improve extraction of embedded files from ppt, pptx and (tallison: rev 7cc610e1b3f164fe9de00b1a35e60fd00a69bb46) tika-parsers/src/test/resources/test-documents/testExcel_embeddedPDF.xls tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java CHANGES.txt tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java tika-parsers/src/test/resources/test-documents/testPPT_EmbeddedPDF.pptx tika-parsers/src/test/resources/test-documents/testExcel_embeddedPDF.xlsx tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java tika-parsers/src/test/resources/test-documents/testPPT_EmbeddedPDF.ppt tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tika-trunk #1075 (See https://builds.apache.org/job/Tika-trunk/1075/)
        TIKA-2026 --fix caps on test files (tallison: rev 52f04bea6075003540a9d8b57768f814f908c442)

        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tika-trunk #1075 (See https://builds.apache.org/job/Tika-trunk/1075/ ) TIKA-2026 --fix caps on test files (tallison: rev 52f04bea6075003540a9d8b57768f814f908c442) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-2.x #116 (See https://builds.apache.org/job/tika-2.x/116/)
        TIKA-2026 – improve extraction of attachments for PPT, PPTX, XLSX (tallison: rev dd3c2a486a41903d5ebeb4bf341be29e02af8499)

        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
        • tika-test-resources/src/test/resources/test-documents/testPPT_embeddedPDF.pptx
        • tika-test-resources/src/test/resources/test-documents/testEXCEL_embeddedPDF.xlsx
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
        • tika-test-resources/src/test/resources/test-documents/testEXCEL_embeddedPDF.xls
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
        • tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java
        • tika-test-resources/src/test/resources/test-documents/testPPT_embeddedPDF.ppt
        • CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-2.x #116 (See https://builds.apache.org/job/tika-2.x/116/ ) TIKA-2026 – improve extraction of attachments for PPT, PPTX, XLSX (tallison: rev dd3c2a486a41903d5ebeb4bf341be29e02af8499) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java tika-test-resources/src/test/resources/test-documents/testPPT_embeddedPDF.pptx tika-test-resources/src/test/resources/test-documents/testEXCEL_embeddedPDF.xlsx tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java tika-test-resources/src/test/resources/test-documents/testEXCEL_embeddedPDF.xls tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java tika-test-resources/src/test/resources/test-documents/testPPT_embeddedPDF.ppt CHANGES.txt

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development