Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2191

Apply current .docx unit tests to experimental SAX parser and fix or document as necessary

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      There are many areas for clean up to ensure that the new SAX .docx parser yields similar results to the legacy DOM .docx parser. Let's use this issue to track work on improvements.

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Finally got around to updating 2.x

        Show
        tallison@mitre.org Tim Allison added a comment - Finally got around to updating 2.x
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1158 (See https://builds.apache.org/job/Tika-trunk/1158/)
        TIKA-2191 – optimize branching in start and endElement based on corpus (tallison: rev 653b980f51feff361381f440bef087e20c69784f)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1158 (See https://builds.apache.org/job/Tika-trunk/1158/ ) TIKA-2191 – optimize branching in start and endElement based on corpus (tallison: rev 653b980f51feff361381f440bef087e20c69784f) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I counted the elements in the main story .xml file (mostly document.xml) in ~150k doc[xm] files in our regression corpus. I optimized the if/else branching in startElement and endElement to test for the most common elements earlier.

        There are a few other interesting things in these stats...including rare "dev" name spaces like http://schemas.openxmlformats.org/wordprocessingml/2006/2/main

        Show
        tallison@mitre.org Tim Allison added a comment - I counted the elements in the main story .xml file (mostly document.xml) in ~150k doc [xm] files in our regression corpus. I optimized the if/else branching in startElement and endElement to test for the most common elements earlier. There are a few other interesting things in these stats...including rare "dev" name spaces like http://schemas.openxmlformats.org/wordprocessingml/2006/2/main
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1155 (See https://builds.apache.org/job/Tika-trunk/1155/)
        TIKA-2191: convert Styles reader to SAX and store only (tallison: rev 0f78a314f52b64d84072758ea66fc0d797271f2f)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1155 (See https://builds.apache.org/job/Tika-trunk/1155/ ) TIKA-2191 : convert Styles reader to SAX and store only (tallison: rev 0f78a314f52b64d84072758ea66fc0d797271f2f) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1154 (See https://builds.apache.org/job/Tika-trunk/1154/)
        TIKA-2191: fixes after regression testing on TIKA_1302 corpus: 1) add (tallison: rev 0f3fe380cb10cb0e1d47a3262287561687544035)

        • (add) tika-parsers/src/test/resources/test-documents/testWORD_template.docx
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1154 (See https://builds.apache.org/job/Tika-trunk/1154/ ) TIKA-2191 : fixes after regression testing on TIKA_1302 corpus: 1) add (tallison: rev 0f3fe380cb10cb0e1d47a3262287561687544035) (add) tika-parsers/src/test/resources/test-documents/testWORD_template.docx
        Hide
        hudson Hudson added a comment -

        UNSTABLE: Integrated in Jenkins build Tika-trunk #1153 (See https://builds.apache.org/job/Tika-trunk/1153/)
        TIKA-2191: fixes after regression testing on TIKA_1302 corpus: 1) add (tallison: rev faf6c2b24814ded27f05388f8a417c2df5bf5c7a)

        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
        • (add) tika-parsers/src/test/resources/test-documents/testWORD_template.dotx
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        Show
        hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build Tika-trunk #1153 (See https://builds.apache.org/job/Tika-trunk/1153/ ) TIKA-2191 : fixes after regression testing on TIKA_1302 corpus: 1) add (tallison: rev faf6c2b24814ded27f05388f8a417c2df5bf5c7a) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (add) tika-parsers/src/test/resources/test-documents/testWORD_template.dotx (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1151 (See https://builds.apache.org/job/Tika-trunk/1151/)
        TIKA-2191 - step 6 add list numbering, bookmarks and styles (tallison: rev 3ee9fd5bf3df913dc8d3cf8cf76da433bb7f9e17)

        • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BodyPartHandler.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
        • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFParagraphProperties.java
        • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFNumberingShim.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1151 (See https://builds.apache.org/job/Tika-trunk/1151/ ) TIKA-2191 - step 6 add list numbering, bookmarks and styles (tallison: rev 3ee9fd5bf3df913dc8d3cf8cf76da433bb7f9e17) (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFStylesShim.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BodyPartHandler.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFParagraphProperties.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFNumberingShim.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        I added paragraph numbering, styles and bookmarks. I think I'm going to punt on handling footnotes and comments closer to where they belong in the document. I'll document that as one of the major differences and call it a day...unless there is an urgent need for this.

        Once I apply the patches to 2.x. I'll resolve this issue and run the regression tests.

        Show
        tallison@mitre.org Tim Allison added a comment - I added paragraph numbering, styles and bookmarks. I think I'm going to punt on handling footnotes and comments closer to where they belong in the document. I'll document that as one of the major differences and call it a day...unless there is an urgent need for this. Once I apply the patches to 2.x. I'll resolve this issue and run the regression tests.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1150 (See https://builds.apache.org/job/Tika-trunk/1150/)
        TIKA-2191 – step1 – add other docx tests and comment/ignore where (tallison: rev 894301307da5167c95585688f9448d3050f53aaa)

        • (add) tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-sax-docx.xml
        • (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
        • (delete) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          TIKA-2191 – step2 – add handling for docm files...extract macros (tallison: rev f93d4e1fffdb4a441f7fa750a43691adfa70c392)
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
          TIKA-2191 – step 3 – clean up <b> and <i> tag handling (tallison: rev 1aca10a26dada02a045a1bc9eb7c3cfc1b36a83e)
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
          TIKA-2191 – step 4-- add markup for embedded pics (tallison: rev 806eaf8b1802a3a3071a5ae0bdc35c20d6739280)
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
          TIKA-2191 – step 5 actually extract images embedded in areas besides (tallison: rev 4469ca2c4ea725e9f5d94c116aaf248deea2a6eb)
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
        • (add) tika-parsers/src/test/resources/test-documents/testWORD_embedded_pics.docx
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
          update changes for TIKA-2191 and TIKA-2192 (tallison: rev 5425d02a1ed97ce5f884a076f55ad8197cc6ac7b)
        • (edit) CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1150 (See https://builds.apache.org/job/Tika-trunk/1150/ ) TIKA-2191 – step1 – add other docx tests and comment/ignore where (tallison: rev 894301307da5167c95585688f9448d3050f53aaa) (add) tika-parsers/src/test/resources/org/apache/tika/parser/microsoft/tika-config-sax-docx.xml (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (delete) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java TIKA-2191 – step2 – add handling for docm files...extract macros (tallison: rev f93d4e1fffdb4a441f7fa750a43691adfa70c392) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java TIKA-2191 – step 3 – clean up <b> and <i> tag handling (tallison: rev 1aca10a26dada02a045a1bc9eb7c3cfc1b36a83e) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java TIKA-2191 – step 4-- add markup for embedded pics (tallison: rev 806eaf8b1802a3a3071a5ae0bdc35c20d6739280) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java TIKA-2191 – step 5 actually extract images embedded in areas besides (tallison: rev 4469ca2c4ea725e9f5d94c116aaf248deea2a6eb) (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (add) tika-parsers/src/test/resources/test-documents/testWORD_embedded_pics.docx (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/SXWPFExtractorTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java update changes for TIKA-2191 and TIKA-2192 (tallison: rev 5425d02a1ed97ce5f884a076f55ad8197cc6ac7b) (edit) CHANGES.txt
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Just pushed a number of fixes focused on hyperlinks, <b|i> tags, extracting objects embedded in headers, etc., and handling for docm files (to extract macros).

        The SAX parser still needs:
        1) application of styles
        2) application paragraph numbering
        3) application of bookmarks
        4) placement of footnotes closer to citation/paragraph.

        Show
        tallison@mitre.org Tim Allison added a comment - Just pushed a number of fixes focused on hyperlinks, <b|i> tags, extracting objects embedded in headers, etc., and handling for docm files (to extract macros). The SAX parser still needs: 1) application of styles 2) application paragraph numbering 3) application of bookmarks 4) placement of footnotes closer to citation/paragraph.

          People

          • Assignee:
            Unassigned
            Reporter:
            tallison@mitre.org Tim Allison
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development