Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2069

Extract Macro text from Microsoft Office documents

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: detector, parser
    • Labels:
    • Environment:

      RHEL 5.x, Apache Tomcat

      Description

      Tika supports macro-enabled Microsoft Office documents by extracting metadata and contents, however, macros within the document are not in the metadata or content output.
      Desire is to have the macro text extracted also.

      Info regarding macro extraction: http://www.decalage.info/vba_tools

      1. excel-macro.PNG
        8 kB
        Jeff Swindle
      2. test-macro-doc.docm
        15 kB
        Jeff Swindle
      3. test-macro-doc.docm-tika-app-output.txt
        3 kB
        Jeff Swindle
      4. tika-app-1.14-20160928.190000-109-test-macro-doc.docm.output
        3 kB
        Jeff Swindle
      5. tika-app-1.14-20160928.190000-109-xlsmacro.xlsm.output
        3 kB
        Jeff Swindle
      6. word-macro.PNG
        24 kB
        Jeff Swindle
      7. xlsmacro.xlsm
        17 kB
        Jeff Swindle
      8. xlsmacro.xlsm.tika-app-output.txt
        2 kB
        Jeff Swindle

        Issue Links

          Activity

          Hide
          tallison@mitre.org Tim Allison added a comment -

          Jeff Swindle, thank you for opening this. Would you be able to share some example test documents and expected output? Bonus points for a unit test or two...

          Show
          tallison@mitre.org Tim Allison added a comment - Jeff Swindle , thank you for opening this. Would you be able to share some example test documents and expected output? Bonus points for a unit test or two...
          Hide
          jeffswindle Jeff Swindle added a comment -

          Word file containing macros. Output from tika-app-1.13. Screen shot of macro within Word file.
          Excel file containing macros. Output from tika-app-1.13. Screen shot of macros within Excel file.

          Show
          jeffswindle Jeff Swindle added a comment - Word file containing macros. Output from tika-app-1.13. Screen shot of macro within Word file. Excel file containing macros. Output from tika-app-1.13. Screen shot of macros within Excel file.
          Hide
          jeffswindle Jeff Swindle added a comment -

          Desire is for TIKA to extract macro text from Microsoft Office files as it does metadata and content.
          Need is to search for specific signatures that may be present in macros and if present should be removed prior to distributing document. TIKA would facilitate the search.

          Show
          jeffswindle Jeff Swindle added a comment - Desire is for TIKA to extract macro text from Microsoft Office files as it does metadata and content. Need is to search for specific signatures that may be present in macros and if present should be removed prior to distributing document. TIKA would facilitate the search.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          Thank you!

          This question is for Jeff Swindle and fellow Tika devs (esp. Ray Gauss II and Nick Burch), should we:
          1) add macro text as metadata items (e.g. msoffice:macro)
          2) inline them in the content via <div> elements?
          3) treat them as embedded documents (mime type would be?)

          I'd prefer option 1 or 3. Option 1 is probably simpler for end users; but option 3 would allow us to capture metadata about the macro.

          Jeff Swindle, the title of this issue is for msoffice...is it ok to limit this to ooxml? Do you need this for the older doc and xls? Already handled by POI at no extra cost.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited Thank you! This question is for Jeff Swindle and fellow Tika devs (esp. Ray Gauss II and Nick Burch ), should we: 1) add macro text as metadata items (e.g. msoffice:macro) 2) inline them in the content via <div> elements? 3) treat them as embedded documents (mime type would be?) I'd prefer option 1 or 3. Option 1 is probably simpler for end users; but option 3 would allow us to capture metadata about the macro. Jeff Swindle , the title of this issue is for msoffice...is it ok to limit this to ooxml? Do you need this for the older doc and xls? Already handled by POI at no extra cost.
          Hide
          jeffswindle Jeff Swindle added a comment -

          OOXML would be great.
          Not just limited to Word and Excel. Need Powerpoint also.

          Show
          jeffswindle Jeff Swindle added a comment - OOXML would be great. Not just limited to Word and Excel. Need Powerpoint also.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thanks to Barry Lagerweij, Nick Burch and Javen O'Neal among others, it looks like this is all nicely handled by POI now as of bug-52949.

          Show
          tallison@mitre.org Tim Allison added a comment - Thanks to Barry Lagerweij , Nick Burch and Javen O'Neal among others, it looks like this is all nicely handled by POI now as of bug-52949 .
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Once we upgrade to POI 3.15-beta3, this should be fairly straightforward, thanks to the work of others on POI. We may want to copy/modify the "find the vba.bin file" at the Tika level for OOXML files to pass an npoifs into VBAMacroReader from an open OOXML/zip file.

          Show
          tallison@mitre.org Tim Allison added a comment - Once we upgrade to POI 3.15-beta3, this should be fairly straightforward, thanks to the work of others on POI. We may want to copy/modify the "find the vba.bin file" at the Tika level for OOXML files to pass an npoifs into VBAMacroReader from an open OOXML/zip file.
          Hide
          gagravarr Nick Burch added a comment -

          I think that, given both how big macros can get and how they logically fit with the document, as an embedded document might be best

          Mimetype wise, some people seem to use application/x-vba, but the office content types file uses application/vnd.ms-office.vbaProject. Our own tika mimetypes file defines text/x-vbasic. I'd lean towards one of the latter two

          Show
          gagravarr Nick Burch added a comment - I think that, given both how big macros can get and how they logically fit with the document, as an embedded document might be best Mimetype wise, some people seem to use application/x-vba , but the office content types file uses application/vnd.ms-office.vbaProject . Our own tika mimetypes file defines text/x-vbasic . I'd lean towards one of the latter two
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Sounds good. Thank you, Nick Burch.

          Do we want to distinguish between an attached vba/text file and a macro? Perhaps add MACRO to TikaCoreProperties.EmbeddedResourceType? Or, do we want to distinguish between the two by using a different mime type? I think I'd prefer the former.

          Show
          tallison@mitre.org Tim Allison added a comment - Sounds good. Thank you, Nick Burch . Do we want to distinguish between an attached vba/text file and a macro? Perhaps add MACRO to TikaCoreProperties.EmbeddedResourceType ? Or, do we want to distinguish between the two by using a different mime type? I think I'd prefer the former.
          Hide
          gagravarr Nick Burch added a comment -

          I think the idea of a Macro is probably general enough across a range of file formats that we could add it as an embedded type

          However, there's actually 2 levels to an OOXML macro. The OOXML file contains a binary vba project bin file, and within that is the actual macro text + its properties. Maybe we should have the ooxml extractor first expose a `application/vnd.ms-office.vbaProject` embedded resource, then we use a second parser which extracts a body of the macro vbscript as text/x-vbasic with the other macro properties/attributes (name, sid, various boolean flags) as metadata?

          eg application/vnd.ms-excel.sheet.macroenabled.12 -> application/vnd.ms-office.vbaProject -> text/x-vbasic + metadata

          Show
          gagravarr Nick Burch added a comment - I think the idea of a Macro is probably general enough across a range of file formats that we could add it as an embedded type However, there's actually 2 levels to an OOXML macro. The OOXML file contains a binary vba project bin file, and within that is the actual macro text + its properties. Maybe we should have the ooxml extractor first expose a `application/vnd.ms-office.vbaProject` embedded resource, then we use a second parser which extracts a body of the macro vbscript as text/x-vbasic with the other macro properties/attributes (name, sid, various boolean flags) as metadata? eg application/vnd.ms-excel.sheet.macroenabled.12 -> application/vnd.ms-office.vbaProject -> text/x-vbasic + metadata
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Makes sense, although I'd prefer to write one parser rather than two. Would the application/vnd.ms-office.vbaProject ever have any content? Would its metadata be different from the vbscript?

          Show
          tallison@mitre.org Tim Allison added a comment - Makes sense, although I'd prefer to write one parser rather than two. Would the application/vnd.ms-office.vbaProject ever have any content? Would its metadata be different from the vbscript?
          Hide
          gagravarr Nick Burch added a comment -

          Yes! If you wrote a VB Script, and zipped it up, it'd be a text/x-vbasic with no extra metadata. When you add a macro to an office doc, you get the macro text but also some metadata. We wouldn't need a parser for {[text/x-vbasic}}, only for application/vnd.ms-office.vbaProject which would expose the embedded script text + metadata

          Show
          gagravarr Nick Burch added a comment - Yes! If you wrote a VB Script, and zipped it up, it'd be a text/x-vbasic with no extra metadata. When you add a macro to an office doc, you get the macro text but also some metadata. We wouldn't need a parser for {[text/x-vbasic}}, only for application/vnd.ms-office.vbaProject which would expose the embedded script text + metadata
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Just realized that we might want to handle extraction of Actions and/or javascript from PDFs in a similar way? New+related ticket if anyone has an interest?

          Show
          tallison@mitre.org Tim Allison added a comment - Just realized that we might want to handle extraction of Actions and/or javascript from PDFs in a similar way? New+related ticket if anyone has an interest?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I think I get it. One challenge is that we're currently getting a Map<String, String> from POI, there doesn't seem currently to be an obvious way to link metadata to the actual text. On POI's test doc,
          with this code:

                  VBAMacroReader reader = new VBAMacroReader(fs);
                  for (Map.Entry<String, String> e : reader.readMacros().entrySet()) {
                      Metadata m = new Metadata();
                      m.set(Metadata.EMBEDDED_RESOURCE_TYPE, TikaCoreProperties.EmbeddedResourceType.MACRO.toString());
                      m.set(Metadata.CONTENT_TYPE, "text/x-vbasic");
                      EmbeddedDocumentExtractor ex = context.get(EmbeddedDocumentExtractor.class);
                      if (ex == null) {
                          ex = new ParsingEmbeddedDocumentExtractor(context);
                      }
                      if (ex.shouldParseEmbedded(m)) {
                          ex.parseEmbedded(new ByteArrayInputStream(e.getValue().getBytes(StandardCharsets.UTF_8)), xhtml, m, true);
                      }
          
                  }
          

          we get:

          1: X-Parsed-By : org.apache.tika.parser.DefaultParser
          1: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
          1: embeddedResourceType : MACRO
          1: Content-Encoding : windows-1252
          1: X-TIKA:parse_time_millis : 27
          1: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml">
          <head>
          <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
          <meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" />
          <meta name="embeddedResourceType" content="MACRO" />
          <meta name="Content-Encoding" content="windows-1252" />
          <meta name="X-TIKA:embedded_resource_path" content="/embedded-1" />
          <meta name="Content-Type" content="text/plain; charset=windows-1252" />
          <title></title>
          </head>
          <body><p>Attribute VB_Name = "Module1"
          Sub TestMacro()
          '
          ' TestMacro Macro
          ' This is a test macro
          '
          
          '
              ActiveDocument.Paragraphs(1).Range.Text = "This is a macro word processing document"
          End Sub
          
          </p>
          </body></html>
          1: X-TIKA:embedded_resource_path : /embedded-1
          1: Content-Type : text/plain; charset=windows-1252
          2: X-Parsed-By : org.apache.tika.parser.DefaultParser
          2: X-Parsed-By : org.apache.tika.parser.txt.TXTParser
          2: embeddedResourceType : MACRO
          2: Content-Encoding : windows-1252
          2: X-TIKA:parse_time_millis : 4
          2: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml">
          <head>
          <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
          <meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" />
          <meta name="embeddedResourceType" content="MACRO" />
          <meta name="Content-Encoding" content="windows-1252" />
          <meta name="X-TIKA:embedded_resource_path" content="/embedded-2" />
          <meta name="Content-Type" content="text/plain; charset=windows-1252" />
          <title></title>
          </head>
          <body><p>Attribute VB_Name = "ThisDocument"
          Attribute VB_Base = "1Normal.ThisDocument"
          Attribute VB_GlobalNameSpace = False
          Attribute VB_Creatable = False
          Attribute VB_PredeclaredId = True
          Attribute VB_Exposed = True
          Attribute VB_TemplateDerived = True
          Attribute VB_Customizable = True
          </p>
          </body></html>
          2: X-TIKA:embedded_resource_path : /embedded-2
          2: Content-Type : text/plain; charset=windows-1252
          

          Is this good enough for now?

          Show
          tallison@mitre.org Tim Allison added a comment - I think I get it. One challenge is that we're currently getting a Map<String, String> from POI, there doesn't seem currently to be an obvious way to link metadata to the actual text. On POI's test doc, with this code: VBAMacroReader reader = new VBAMacroReader(fs); for (Map.Entry<String, String> e : reader.readMacros().entrySet()) { Metadata m = new Metadata(); m.set(Metadata.EMBEDDED_RESOURCE_TYPE, TikaCoreProperties.EmbeddedResourceType.MACRO.toString()); m.set(Metadata.CONTENT_TYPE, "text/x-vbasic"); EmbeddedDocumentExtractor ex = context.get(EmbeddedDocumentExtractor.class); if (ex == null) { ex = new ParsingEmbeddedDocumentExtractor(context); } if (ex.shouldParseEmbedded(m)) { ex.parseEmbedded(new ByteArrayInputStream(e.getValue().getBytes(StandardCharsets.UTF_8)), xhtml, m, true); } } we get: 1: X-Parsed-By : org.apache.tika.parser.DefaultParser 1: X-Parsed-By : org.apache.tika.parser.txt.TXTParser 1: embeddedResourceType : MACRO 1: Content-Encoding : windows-1252 1: X-TIKA:parse_time_millis : 27 1: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" /> <meta name="embeddedResourceType" content="MACRO" /> <meta name="Content-Encoding" content="windows-1252" /> <meta name="X-TIKA:embedded_resource_path" content="/embedded-1" /> <meta name="Content-Type" content="text/plain; charset=windows-1252" /> <title></title> </head> <body><p>Attribute VB_Name = "Module1" Sub TestMacro() ' ' TestMacro Macro ' This is a test macro ' ' ActiveDocument.Paragraphs(1).Range.Text = "This is a macro word processing document" End Sub </p> </body></html> 1: X-TIKA:embedded_resource_path : /embedded-1 1: Content-Type : text/plain; charset=windows-1252 2: X-Parsed-By : org.apache.tika.parser.DefaultParser 2: X-Parsed-By : org.apache.tika.parser.txt.TXTParser 2: embeddedResourceType : MACRO 2: Content-Encoding : windows-1252 2: X-TIKA:parse_time_millis : 4 2: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.txt.TXTParser" /> <meta name="embeddedResourceType" content="MACRO" /> <meta name="Content-Encoding" content="windows-1252" /> <meta name="X-TIKA:embedded_resource_path" content="/embedded-2" /> <meta name="Content-Type" content="text/plain; charset=windows-1252" /> <title></title> </head> <body><p>Attribute VB_Name = "ThisDocument" Attribute VB_Base = "1Normal.ThisDocument" Attribute VB_GlobalNameSpace = False Attribute VB_Creatable = False Attribute VB_PredeclaredId = True Attribute VB_Exposed = True Attribute VB_TemplateDerived = True Attribute VB_Customizable = True </p> </body></html> 2: X-TIKA:embedded_resource_path : /embedded-2 2: Content-Type : text/plain; charset=windows-1252 Is this good enough for now?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          This reminds me that I need to commit TIKA-2047 so that the mime-type isn't overwritten.

          Show
          tallison@mitre.org Tim Allison added a comment - This reminds me that I need to commit TIKA-2047 so that the mime-type isn't overwritten.
          Hide
          jeffswindle Jeff Swindle added a comment -

          For my purposes, the output shown is good. I need the macro text content
          primarily.

          Thanks Tim!

          Show
          jeffswindle Jeff Swindle added a comment - For my purposes, the output shown is good. I need the macro text content primarily. Thanks Tim!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Jeff Swindle, I should point out that the VBAMacroReader is still relatively new in POI, and there are currently 3 open bugs, one triggered by the docm file that you submitted.

          For now, we'll swallow the exceptions in Tika, but there's much more work to be done. Patches to POI would be welcomed!

          Show
          tallison@mitre.org Tim Allison added a comment - Jeff Swindle , I should point out that the VBAMacroReader is still relatively new in POI, and there are currently 3 open bugs, one triggered by the docm file that you submitted. 60158 59830 59858 For now, we'll swallow the exceptions in Tika, but there's much more work to be done. Patches to POI would be welcomed!
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Currently, multiple macros are appended to one string in POI.

          <body><p>Attribute VB_Name = "NewMacros"
          Sub Embolden()
          Attribute Embolden.VB_Description = "This tests changing the selection to bold"
          Attribute Embolden.VB_ProcData.VB_Invoke_Func = "Project.NewMacros.Embolden"
          '
          ' Embolden Macro
          '
          '
              Selection.Font.Bold = wdToggle
              Selection.Font.BoldBi = wdToggle
          End Sub
          
          Sub Italicize()
          Attribute Italicize.VB_Description = "This tests italicizing"
          Attribute Italicize.VB_ProcData.VB_Invoke_Func = "Project.NewMacros.Italicize"
          '
          ' Italicize Macro
          '
          '
              Selection.Font.Italic = wdToggle
              Selection.Font.ItalicBi = wdToggle
          End Sub
          
          Show
          tallison@mitre.org Tim Allison added a comment - Currently, multiple macros are appended to one string in POI. <body><p>Attribute VB_Name = "NewMacros" Sub Embolden() Attribute Embolden.VB_Description = "This tests changing the selection to bold" Attribute Embolden.VB_ProcData.VB_Invoke_Func = "Project.NewMacros.Embolden" ' ' Embolden Macro ' ' Selection.Font.Bold = wdToggle Selection.Font.BoldBi = wdToggle End Sub Sub Italicize() Attribute Italicize.VB_Description = "This tests italicizing" Attribute Italicize.VB_ProcData.VB_Invoke_Func = "Project.NewMacros.Italicize" ' ' Italicize Macro ' ' Selection.Font.Italic = wdToggle Selection.Font.ItalicBi = wdToggle End Sub
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I think there may be a bit more work to do at the POI level. There are still a few open issues in POI for NPE, AIOOBE, etc. Tika is currently swallowing these...I plan to do a run against our regression corpus with the swallowing turned off to help us prioritize known and identify new bugs in macro extraction at the POI level.

          I also found that POI wasn't extracting macros from the 'ppt' file I created as a test (see poi 60162).

          Patches are welcomed!

          Let's close this ticket and open another to track the improvements in POI.

          Show
          tallison@mitre.org Tim Allison added a comment - I think there may be a bit more work to do at the POI level. There are still a few open issues in POI for NPE, AIOOBE, etc. Tika is currently swallowing these...I plan to do a run against our regression corpus with the swallowing turned off to help us prioritize known and identify new bugs in macro extraction at the POI level. I also found that POI wasn't extracting macros from the 'ppt' file I created as a test (see poi 60162 ). Patches are welcomed! Let's close this ticket and open another to track the improvements in POI.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #50 (See https://builds.apache.org/job/tika-2.x-windows/50/)
          TIKA-2069 – extract macros from MSOffice files. (tallison: rev 66f433471f59d5af931f0a49bf8bddd33a7f27a7)

          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.doc
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xlsm
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.pptm
          • (edit) CHANGES.txt
          • (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xls
          • (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.ppt
          • (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.docm
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #50 (See https://builds.apache.org/job/tika-2.x-windows/50/ ) TIKA-2069 – extract macros from MSOffice files. (tallison: rev 66f433471f59d5af931f0a49bf8bddd33a7f27a7) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.doc (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xlsm (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.pptm (edit) CHANGES.txt (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xls (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.ppt (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.docm
          Hide
          hudson Hudson added a comment -

          ABORTED: Integrated in Jenkins build tika-2.x #146 (See https://builds.apache.org/job/tika-2.x/146/)
          TIKA-2069 – extract macros from MSOffice files. (tallison: rev 66f433471f59d5af931f0a49bf8bddd33a7f27a7)

          • (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.docm
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xls
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.doc
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.pptm
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
          • (edit) CHANGES.txt
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
          • (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xlsm
          • (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.ppt
          Show
          hudson Hudson added a comment - ABORTED: Integrated in Jenkins build tika-2.x #146 (See https://builds.apache.org/job/tika-2.x/146/ ) TIKA-2069 – extract macros from MSOffice files. (tallison: rev 66f433471f59d5af931f0a49bf8bddd33a7f27a7) (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.docm (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xls (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_macros.doc (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.pptm (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java (edit) CHANGES.txt (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (add) tika-test-resources/src/test/resources/test-documents/testEXCEL_macro.xlsm (add) tika-test-resources/src/test/resources/test-documents/testPPT_macros.ppt
          Hide
          hudson Hudson added a comment -

          ABORTED: Integrated in Jenkins build Tika-trunk #1104 (See https://builds.apache.org/job/Tika-trunk/1104/)
          TIKA-2069 – extract macros from MSOffice docs (tallison: rev 2ae7206d9c99fb553314cff21bb155d4e6f06d12)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
          • (edit) CHANGES.txt
          • (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_macros.docm
          • (add) tika-parsers/src/test/resources/test-documents/testPPT_macros.pptm
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_macros.doc
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testEXCEL_macro.xlsm
          • (add) tika-parsers/src/test/resources/test-documents/testEXCEL_macro.xls
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java
          • (add) tika-parsers/src/test/resources/test-documents/testPPT_macros.ppt
          Show
          hudson Hudson added a comment - ABORTED: Integrated in Jenkins build Tika-trunk #1104 (See https://builds.apache.org/job/Tika-trunk/1104/ ) TIKA-2069 – extract macros from MSOffice docs (tallison: rev 2ae7206d9c99fb553314cff21bb155d4e6f06d12) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java (edit) CHANGES.txt (edit) tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java (add) tika-parsers/src/test/resources/test-documents/testWORD_macros.docm (add) tika-parsers/src/test/resources/test-documents/testPPT_macros.pptm (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (add) tika-parsers/src/test/resources/test-documents/testWORD_macros.doc (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java (add) tika-parsers/src/test/resources/test-documents/testEXCEL_macro.xlsm (add) tika-parsers/src/test/resources/test-documents/testEXCEL_macro.xls (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java (add) tika-parsers/src/test/resources/test-documents/testPPT_macros.ppt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #51 (See https://builds.apache.org/job/tika-2.x-windows/51/)
          TIKA-2069 – extract macros from MSOffice docs, fix tests to find target (tallison: rev d543378a88aeca574d15ab31d13b6316fb938f7f)

          • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #51 (See https://builds.apache.org/job/tika-2.x-windows/51/ ) TIKA-2069 – extract macros from MSOffice docs, fix tests to find target (tallison: rev d543378a88aeca574d15ab31d13b6316fb938f7f) (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #147 (See https://builds.apache.org/job/tika-2.x/147/)
          TIKA-2069 – extract macros from MSOffice docs, fix tests to find target (tallison: rev d543378a88aeca574d15ab31d13b6316fb938f7f)

          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #147 (See https://builds.apache.org/job/tika-2.x/147/ ) TIKA-2069 – extract macros from MSOffice docs, fix tests to find target (tallison: rev d543378a88aeca574d15ab31d13b6316fb938f7f) (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1105 (See https://builds.apache.org/job/Tika-trunk/1105/)
          TIKA-2069 – extract macros from MSOffice docs, fix tests to find target (tallison: rev 8a45f67a2e3641b08fcfb5e2283e4a43ff86f3cd)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java
          • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1105 (See https://builds.apache.org/job/Tika-trunk/1105/ ) TIKA-2069 – extract macros from MSOffice docs, fix tests to find target (tallison: rev 8a45f67a2e3641b08fcfb5e2283e4a43ff86f3cd) (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ExcelParserTest.java (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
          Hide
          jeffswindle Jeff Swindle added a comment -

          Output of tika-app against test files.
          xlsmacro.xlsm run outputs macro contents.
          test-macro-doc.docm doesn't output macro contents.

          Show
          jeffswindle Jeff Swindle added a comment - Output of tika-app against test files. xlsmacro.xlsm run outputs macro contents. test-macro-doc.docm doesn't output macro contents.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Right. Sorry. Unfortunately, there's a bug in POI that prevents reading the macro in your docm file. See above.

          There's still some work to do on the POI side.

          Show
          tallison@mitre.org Tim Allison added a comment - Right. Sorry. Unfortunately, there's a bug in POI that prevents reading the macro in your docm file. See above . There's still some work to do on the POI side.
          Hide
          jeffswindle Jeff Swindle added a comment -

          Tim Allison I tried a tika-app 1.14 snapshot and didn't get the expected output for the test-macro-doc.docm file. I also tried another internal file and didn't see macro output.
          Executing against xlsmacro.xlsm provided expected output of macro content.

          I've attached the output from tika-app against xlsmacro.xlsm and test-macro-doc.docm.
          Here are the commands I used:

          1. java -jar tika-app-1.14-20160928.190000-109.jar test-macro-doc.docm > tika-app-1.14-20160928.190000-109-test-macro-doc.docm.output
          2. java -jar tika-app-1.14-20160928.190000-109.jar xlsmacro.xlsm > tika-app-1.14-20160928.190000-109-xlsmacro.xlsm.output
            Is there something specific I need to add to the command to extract the macro in the docm?
          Show
          jeffswindle Jeff Swindle added a comment - Tim Allison I tried a tika-app 1.14 snapshot and didn't get the expected output for the test-macro-doc.docm file. I also tried another internal file and didn't see macro output. Executing against xlsmacro.xlsm provided expected output of macro content. I've attached the output from tika-app against xlsmacro.xlsm and test-macro-doc.docm. Here are the commands I used: java -jar tika-app-1.14-20160928.190000-109.jar test-macro-doc.docm > tika-app-1.14-20160928.190000-109-test-macro-doc.docm.output java -jar tika-app-1.14-20160928.190000-109.jar xlsmacro.xlsm > tika-app-1.14-20160928.190000-109-xlsmacro.xlsm.output Is there something specific I need to add to the command to extract the macro in the docm?
          Hide
          jeffswindle Jeff Swindle added a comment -

          Thanks.

          Show
          jeffswindle Jeff Swindle added a comment - Thanks.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, sorry. I opened TIKA-2104 to track this.

          Show
          tallison@mitre.org Tim Allison added a comment - Y, sorry. I opened TIKA-2104 to track this.

            People

            • Assignee:
              Unassigned
              Reporter:
              jeffswindle Jeff Swindle
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development