Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2179

WordMLParser fails to parse a word xml file

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None
    • Environment:

      OSX, java 8

      Description

      Problem

      I have a sample word xml file (attached as File5.xml) that can be parsed by neither OOXMLParser (yields an exception that was Caused by: org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: The supplied data appears to be a raw XML file. Formats such as Office 2003 XML are not supported) nor by OfficeParser (yields an exception like: org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data appears to be a raw XML file. Formats such as Office 2003 XML are not supported

      I found TIKA-1958 which mentioned the new WordMLParser, so downloaded the source, built, and updated my tika version to 1.14. However, when parsing with WordMLParser, the output text content I get is the empty string "", but I'm expecting something more like:

      It means that the guy that you are trading with was reported for a scam attempt. As the others mentioned, some of these BOFA could be false.
      What's important is the current trade that you are doing.
      If everything seems to be in order then there is nothing wrong with going through with the trade.
      Auti, Sneha (QAPM)
      

      Replication

      You can replicate with the below Spock test

          def "display error with WordMLParser"(){
              setup:
              File input = new File("/Users/sstory/Downloads/File5.xml") //modify for your path
              Parser parser = new WordMLParser()
              //Parser parser = new OOXMLParser()
              //Parser parser = new OfficeParser()
              org.xml.sax.ContentHandler textHandler = new BodyContentHandler(-1)
              Metadata metadata = new Metadata()
              ParseContext context = new ParseContext()
              
              when:
              parser.parse(input.newInputStream(), textHandler, metadata, context)
              String result = textHandler.toString()
      
              then:
              !result.isEmpty()
              result.contains("the guy that you are trading with")
              result.contains("BOFA")
          }
      
      1. File5.xml
        48 kB
        Sean Story

        Issue Links

          Activity

          Hide
          seanstory Sean Story added a comment -

          Using XMLParser provides a reasonable workaround, but the output ends up looking like:

                                        It means that the guy that you are trading with was reported for a scam attempt. As the others mentioned, some of these BO     FA      could be false.           What's important is the current trade that you are doing.           If everything seems to be in order then there is nothing wrong with going through with the trade.                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Auti, Sneha (QAPM) Auti, Sneha (QAPM) 2 2016-09-14T06:16:00Z 2016-09-14T06:23:00Z                                                                                                                                                                                                              Normal.dotm 7 1 44 257 Microsoft Office Word 0 2 1 false Morgan Stanley false 300 false false 14.0000
          

          which is sub-optimal, since it has added whitespace characters all over the content

          Show
          seanstory Sean Story added a comment - Using XMLParser provides a reasonable workaround, but the output ends up looking like: It means that the guy that you are trading with was reported for a scam attempt. As the others mentioned, some of these BO FA could be false. What's important is the current trade that you are doing. If everything seems to be in order then there is nothing wrong with going through with the trade. Auti, Sneha (QAPM) Auti, Sneha (QAPM) 2 2016-09-14T06:16:00Z 2016-09-14T06:23:00Z Normal.dotm 7 1 44 257 Microsoft Office Word 0 2 1 false Morgan Stanley false 300 false false 14.0000 which is sub-optimal, since it has added whitespace characters all over the content
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you for opening this and sharing a triggering file. This looks like an intermediary between 2003 and ooxml. I'll take a look.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you for opening this and sharing a triggering file. This looks like an intermediary between 2003 and ooxml. I'll take a look.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, this is a 2006 ML formatted document. Following the example on that link, I can generate this format by saving as "Word XML Document".

          The format appears to be an "inlined" ooxml OPCPackage where the parts are not separate streams in a zip, but are identified by <pkg:part> elements.

          I suspect the "right" solution would be to create a new subclass of OPCPackage in POI that handles this inline format. I wonder if it would be more expedient, though, to create a temporary zip file, rewrite this as a more modern .docx and then use the usual ZipPackage and XWPFWordExtractorDecorator.

          Show
          tallison@mitre.org Tim Allison added a comment - Y, this is a 2006 ML formatted document. Following the example on that link, I can generate this format by saving as "Word XML Document". The format appears to be an "inlined" ooxml OPCPackage where the parts are not separate streams in a zip, but are identified by <pkg:part> elements. I suspect the "right" solution would be to create a new subclass of OPCPackage in POI that handles this inline format. I wonder if it would be more expedient, though, to create a temporary zip file, rewrite this as a more modern .docx and then use the usual ZipPackage and XWPFWordExtractorDecorator.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          How's this look:

          0: cp:revision : 2
          0: date : 2016-09-14T06:23:00Z
          0: extended-properties:DocSecurity : 0
          0: extended-properties:AppVersion : 14.0000
          0: meta:word-count : 44
          0: meta:paragraph-count : 1
          0: dc:creator : Auti, Sneha (QAPM)
          0: extended-properties:Company : Morgan Stanley
          0: dcterms:created : 2016-09-14T06:16:00Z
          0: meta:line-count : 2
          0: dcterms:modified : 2016-09-14T06:23:00Z
          0: Last-Modified : 2016-09-14T06:23:00Z
          0: Last-Save-Date : 2016-09-14T06:23:00Z
          0: meta:character-count : 257
          0: meta:save-date : 2016-09-14T06:23:00Z
          0: meta:character-count-with-spaces : 300
          0: extended-properties:TotalTime : 7
          0: modified : 2016-09-14T06:23:00Z
          0: Content-Type : application/vnd.ms-word2006ml
          0: X-Parsed-By : org.apache.tika.parser.DefaultParser
          0: X-Parsed-By : org.apache.tika.parser.microsoft.ooxml.xwpf.Word2006MLParser
          0: creator : Auti, Sneha (QAPM)
          0: meta:author : Auti, Sneha (QAPM)
          0: meta:creation-date : 2016-09-14T06:16:00Z
          0: extended-properties:Application : Microsoft Office Word
          0: Creation-Date : 2016-09-14T06:16:00Z
          0: cp:lastModifiedBy : Auti, Sneha (QAPM)
          0: extended-properties:Template : Normal.dotm
          0: X-TIKA:parse_time_millis : 194
          0: Author : Auti, Sneha (QAPM)
          0: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml">
          <head>
          <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
          <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.xwpf.Word2006MLParser" />
          <meta name="Content-Type" content="application/vnd.ms-word2006ml" />
          <title></title>
          </head>
          <body><p>It means that the guy that you are trading with was reported for a scam attempt. As the others mentioned, some of these BOFA could be false. </p>
          <p>What's important is the current trade that you are doing. </p>
          <p>If everything seems to be in order then there is nothing wrong with going through with the trade. </p>
          <p />
          </body></html>
          0: meta:page-count : 1
          
          Show
          tallison@mitre.org Tim Allison added a comment - How's this look: 0: cp:revision : 2 0: date : 2016-09-14T06:23:00Z 0: extended-properties:DocSecurity : 0 0: extended-properties:AppVersion : 14.0000 0: meta:word-count : 44 0: meta:paragraph-count : 1 0: dc:creator : Auti, Sneha (QAPM) 0: extended-properties:Company : Morgan Stanley 0: dcterms:created : 2016-09-14T06:16:00Z 0: meta:line-count : 2 0: dcterms:modified : 2016-09-14T06:23:00Z 0: Last-Modified : 2016-09-14T06:23:00Z 0: Last-Save-Date : 2016-09-14T06:23:00Z 0: meta:character-count : 257 0: meta:save-date : 2016-09-14T06:23:00Z 0: meta:character-count-with-spaces : 300 0: extended-properties:TotalTime : 7 0: modified : 2016-09-14T06:23:00Z 0: Content-Type : application/vnd.ms-word2006ml 0: X-Parsed-By : org.apache.tika.parser.DefaultParser 0: X-Parsed-By : org.apache.tika.parser.microsoft.ooxml.xwpf.Word2006MLParser 0: creator : Auti, Sneha (QAPM) 0: meta:author : Auti, Sneha (QAPM) 0: meta:creation-date : 2016-09-14T06:16:00Z 0: extended-properties:Application : Microsoft Office Word 0: Creation-Date : 2016-09-14T06:16:00Z 0: cp:lastModifiedBy : Auti, Sneha (QAPM) 0: extended-properties:Template : Normal.dotm 0: X-TIKA:parse_time_millis : 194 0: Author : Auti, Sneha (QAPM) 0: X-TIKA:content : <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" /> <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.xwpf.Word2006MLParser" /> <meta name="Content-Type" content="application/vnd.ms-word2006ml" /> <title></title> </head> <body><p>It means that the guy that you are trading with was reported for a scam attempt. As the others mentioned, some of these BOFA could be false. </p> <p>What's important is the current trade that you are doing. </p> <p>If everything seems to be in order then there is nothing wrong with going through with the trade. </p> <p /> </body></html> 0: meta:page-count : 1
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #77 (See https://builds.apache.org/job/tika-2.x-windows/77/)
          Add mime detection and parser for Word 2006ML format (TIKA-2179). (tallison: rev 2f452304b9628e28caf89e714f83e01fce481a30)

          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml_src.docx
          • (edit) CHANGES.txt
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.xml
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2003ml.xml
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java
          • (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #77 (See https://builds.apache.org/job/tika-2.x-windows/77/ ) Add mime detection and parser for Word 2006ML format ( TIKA-2179 ). (tallison: rev 2f452304b9628e28caf89e714f83e01fce481a30) (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml_src.docx (edit) CHANGES.txt (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.xml (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_2003ml.xml (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          I committed a reasonable first pass at this.

          Still left on the list for further work on other tickets (so that I don't forget):

          1. Convert to a double pass...read the extra stuff first, then parse the main document.xml. The list info comes after the document, and on a single pass, that and several other things fall by the wayside. Hyperlinks happen to work, but that's only because those rels happen to come before document.xml in the test doc.
          b. Add macro extraction from the ole.bin
          iii. Make inline image markup consistent with xwpf
          4. Figure out how to handle the chart data
          E) include proper div markings for non main document content, footers, headers, etc.
          VI - We are skipping "alternateContent" Fallback in favor of Choice. At least with the chart in the test file, this is not the right choice. Which should we pick?

          What this has that our current docx extractor doesn't at the moment:
          1) no beans, purely read only <wild_speculation>should have better memory footprint</wild_speculation> (see also TIKA-1321)
          2) ability to choose whether or not to extract deleted text (TIKA-2036)
          3) ability to handle glossary document content (TIKA-2163)
          4) <wild_speculation>I think this should be immune to the rare unicode bugs that we've seen with DOM...I need to test this (see TIKA-1961)</wild_speculation>
          5) <wild_speculation>we're not likely to miss content because we're grabbing <w:t> wherever they are (TIKA-1317 and friends). </wild_speculation>

          On the down side...this re-invents several helper classes from POI and Tika , which I really, really regret.

          Open questions:
          1. Nick Burch and fellow devs, how does this look commit? Anything crazy that ought to be fixed, including the mime-type?
          2. Is there any way to move most of this into POI? The current OPCPackage and the rest of the code appears to be tightly tied to ZipPackage and beans. I could add this stuff as a standalone streaming/readonly xwpf set of objects, but do we want that in POI?
          3. What do you think of converting our current docx processing to these classes? I don't think it would take much to rework a bit to pull the related bits from the zip and then process the document.xml as we're currently doing.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited I committed a reasonable first pass at this. Still left on the list for further work on other tickets (so that I don't forget): 1. Convert to a double pass...read the extra stuff first, then parse the main document.xml. The list info comes after the document, and on a single pass, that and several other things fall by the wayside. Hyperlinks happen to work, but that's only because those rels happen to come before document.xml in the test doc. b. Add macro extraction from the ole.bin iii. Make inline image markup consistent with xwpf 4. Figure out how to handle the chart data E) include proper div markings for non main document content, footers, headers, etc. VI - We are skipping "alternateContent" Fallback in favor of Choice . At least with the chart in the test file, this is not the right choice. Which should we pick? What this has that our current docx extractor doesn't at the moment: 1) no beans, purely read only <wild_speculation>should have better memory footprint</wild_speculation> (see also TIKA-1321 ) 2) ability to choose whether or not to extract deleted text ( TIKA-2036 ) 3) ability to handle glossary document content ( TIKA-2163 ) 4) <wild_speculation>I think this should be immune to the rare unicode bugs that we've seen with DOM...I need to test this (see TIKA-1961 )</wild_speculation> 5) <wild_speculation>we're not likely to miss content because we're grabbing <w:t> wherever they are ( TIKA-1317 and friends). </wild_speculation> On the down side...this re-invents several helper classes from POI and Tika , which I really, really regret. Open questions: 1. Nick Burch and fellow devs, how does this look commit ? Anything crazy that ought to be fixed, including the mime-type? 2. Is there any way to move most of this into POI? The current OPCPackage and the rest of the code appears to be tightly tied to ZipPackage and beans. I could add this stuff as a standalone streaming/readonly xwpf set of objects, but do we want that in POI? 3. What do you think of converting our current docx processing to these classes? I don't think it would take much to rework a bit to pull the related bits from the zip and then process the document.xml as we're currently doing.
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Jenkins build tika-2.x #176 (See https://builds.apache.org/job/tika-2.x/176/)
          Add mime detection and parser for Word 2006ML format (TIKA-2179). (tallison: rev 2f452304b9628e28caf89e714f83e01fce481a30)

          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml_src.docx
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2003ml.xml
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.xml
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • (edit) CHANGES.txt
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java
          • (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java
          • (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x #176 (See https://builds.apache.org/job/tika-2.x/176/ ) Add mime detection and parser for Word 2006ML format ( TIKA-2179 ). (tallison: rev 2f452304b9628e28caf89e714f83e01fce481a30) (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml_src.docx (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_2003ml.xml (add) tika-test-resources/src/test/resources/test-documents/testWORD_2006ml.xml (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java (edit) tika-parser-modules/tika-parser-office-module/src/main/resources/META-INF/services/org.apache.tika.parser.Parser (edit) CHANGES.txt (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java (add) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java (add) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java (edit) tika-parser-bundles/tika-parser-office-bundle/src/test/java/org/apache/tika/module/office/BundleIT.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1144 (See https://builds.apache.org/job/Tika-trunk/1144/)
          TIKA-2179 – add detection and parsing for word2006ml files (tallison: rev 81fad8c97e60a3de7d926dc4ce10cbd235549583)

          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java
          • (edit) CHANGES.txt
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java
          • (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.xml
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java
          • (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml_src.docx
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_2003ml.xml
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java
          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1144 (See https://builds.apache.org/job/Tika-trunk/1144/ ) TIKA-2179 – add detection and parsing for word2006ml files (tallison: rev 81fad8c97e60a3de7d926dc4ce10cbd235549583) (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java (edit) CHANGES.txt (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.xml (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml_src.docx (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java (add) tika-parsers/src/test/resources/test-documents/testWORD_2003ml.xml (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you, Sean Story, for opening this. Let us know what else you find.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you, Sean Story , for opening this. Let us know what else you find.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build tika-2.x-windows #79 (See https://builds.apache.org/job/tika-2.x-windows/79/)
          TIKA-2179 – add detection and parsing for word2006ml files – this (tallison: rev 1bb7c33846203900c1ec791c7a2a958912da2a9c)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #79 (See https://builds.apache.org/job/tika-2.x-windows/79/ ) TIKA-2179 – add detection and parsing for word2006ml files – this (tallison: rev 1bb7c33846203900c1ec791c7a2a958912da2a9c) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #178 (See https://builds.apache.org/job/tika-2.x/178/)
          TIKA-2179 – add detection and parsing for word2006ml files – this (tallison: rev 1bb7c33846203900c1ec791c7a2a958912da2a9c)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #178 (See https://builds.apache.org/job/tika-2.x/178/ ) TIKA-2179 – add detection and parsing for word2006ml files – this (tallison: rev 1bb7c33846203900c1ec791c7a2a958912da2a9c) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              seanstory Sean Story
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development