In a few weeks...well, maybe not.
The current dev implementation doesn't handle everything that our current extractor does, but it does handle some things our current implementation doesn't.
The dev implementation uses beans for all parts that aren't document.xml or the glossary-document, but then SAX for the document and glossary document.
Wall clock sequential tests for our test suite's docx files (100 iterations):
Current: 25 seconds
Proposed: 16 seconds
Once we add "War and Peace" to our test suite's docx files (10 iterations):
Current: 89 seconds
Proposed: 15 seconds
These initial benchmarks suggest that a SAX/read-only docx extractor might be worth the effort.