Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1321

Add experimental SAX/Streaming XWPF/docx extractor

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: parser
    • Labels:
      None

      Description

      I'd like to contribute an experimental streaming extractor for docx. I should have something ready for committing in a few weeks. I'll attach drafts as they're ready.

      At least for a couple of releases, I'd like to keep it in o.a.t.parser.microsoft.ooxml.experimental if that makes sense.

        Issue Links

          Activity

          Hide
          tpalsulich Tyler Palsulich added a comment -

          Did you ever get a chance to build this, Tim Allison?

          Show
          tpalsulich Tyler Palsulich added a comment - Did you ever get a chance to build this, Tim Allison ?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, but it isn't ready for primetime/committing yet. Still on my todo list.

          Show
          tallison@mitre.org Tim Allison added a comment - Y, but it isn't ready for primetime/committing yet. Still on my todo list.
          Hide
          tallison@mitre.org Tim Allison added a comment - - edited

          In a few weeks...well, maybe not.

          The current dev implementation doesn't handle everything that our current extractor does, but it does handle some things our current implementation doesn't.

          The dev implementation uses beans for all parts that aren't document.xml or the glossary-document, but then SAX for the document and glossary document.

          Wall clock sequential tests for our test suite's docx files (100 iterations):
          Current: 25 seconds
          Proposed: 16 seconds

          Once we add "War and Peace" to our test suite's docx files (10 iterations):
          Current: 89 seconds
          Proposed: 15 seconds

          These initial benchmarks suggest that a SAX/read-only docx extractor might be worth the effort.

          Show
          tallison@mitre.org Tim Allison added a comment - - edited In a few weeks...well, maybe not. The current dev implementation doesn't handle everything that our current extractor does, but it does handle some things our current implementation doesn't. The dev implementation uses beans for all parts that aren't document.xml or the glossary-document, but then SAX for the document and glossary document. Wall clock sequential tests for our test suite's docx files (100 iterations): Current: 25 seconds Proposed: 16 seconds Once we add "War and Peace" to our test suite's docx files (10 iterations): Current: 89 seconds Proposed: 15 seconds These initial benchmarks suggest that a SAX/read-only docx extractor might be worth the effort.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          It would be nice to unify these parsers to the degree possible.

          Show
          tallison@mitre.org Tim Allison added a comment - It would be nice to unify these parsers to the degree possible.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Initial parser is added. We may want to move some bits into POI. More work remains.

          Show
          tallison@mitre.org Tim Allison added a comment - Initial parser is added. We may want to move some bits into POI. More work remains.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1147 (See https://builds.apache.org/job/Tika-trunk/1147/)
          TIKA-1321 – add SAX based docx parser and integrate it with the recent (tallison: rev d19e4725ff0549597f9156bb0c1e7759f6ce08d9)

          • (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.docx
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParser.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BinaryDataHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/AbstractPartHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/ExtendedPropertiesHandler.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/PartHandler.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (delete) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/RelationshipsHandler.java
          • (edit) CHANGES.txt
          • (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • (edit) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.xml
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BodyPartHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/RelationshipsManager.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java
          • (edit) tika-parsers/src/test/resources/test-documents/testWORD_2003ml.xml
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java
          • (edit) tika-core/src/main/java/org/apache/tika/utils/DateUtils.java
          • (delete) tika-parsers/src/test/resources/test-documents/testWORD_2006ml_src.docx
          • (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Relationship.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFRunProperties.java
          • (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java
          • (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/CorePropertiesHandler.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java
          • (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1147 (See https://builds.apache.org/job/Tika-trunk/1147/ ) TIKA-1321 – add SAX based docx parser and integrate it with the recent (tallison: rev d19e4725ff0549597f9156bb0c1e7759f6ce08d9) (add) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.docx (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParser.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BinaryDataHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/AbstractPartHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/ExtendedPropertiesHandler.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BinaryDataHandler.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/MSOfficeParserConfig.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/PartHandler.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (delete) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParserTest.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/RelationshipsHandler.java (edit) CHANGES.txt (edit) tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser (edit) tika-parsers/src/test/resources/test-documents/testWORD_2006ml.xml (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsManager.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFTikaBodyPartHandler.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/RelationshipsHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/BodyPartHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/RelationshipsManager.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Relationship.java (edit) tika-parsers/src/test/resources/test-documents/testWORD_2003ml.xml (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/PartHandler.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/CorePropertiesHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFDocumentXMLBodyHandler.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/BodyContentHandler.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractOfficeParser.java (edit) tika-core/src/main/java/org/apache/tika/utils/DateUtils.java (delete) tika-parsers/src/test/resources/test-documents/testWORD_2006ml_src.docx (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/SXWPFExtractorTest.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Relationship.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParserConfig.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ExtendedPropertiesHandler.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFRunProperties.java (delete) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/Word2006MLParser.java (add) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/CorePropertiesHandler.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java (add) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/ml2006/Word2006MLDocHandler.java

            People

            • Assignee:
              tallison@mitre.org Tim Allison
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development