I committed a reasonable first pass at this.
Still left on the list for further work on other tickets (so that I don't forget):
1. Convert to a double pass...read the extra stuff first, then parse the main document.xml. The list info comes after the document, and on a single pass, that and several other things fall by the wayside. Hyperlinks happen to work, but that's only because those rels happen to come before document.xml in the test doc.
b. Add macro extraction from the ole.bin
iii. Make inline image markup consistent with xwpf
4. Figure out how to handle the chart data
E) include proper div markings for non main document content, footers, headers, etc.
VI - We are skipping "alternateContent" Fallback in favor of Choice. At least with the chart in the test file, this is not the right choice. Which should we pick?
What this has that our current docx extractor doesn't at the moment:
1) no beans, purely read only <wild_speculation>should have better memory footprint</wild_speculation> (see also
2) ability to choose whether or not to extract deleted text (TIKA-2036)
3) ability to handle glossary document content (TIKA-2163)
4) <wild_speculation>I think this should be immune to the rare unicode bugs that we've seen with DOM...I need to test this (see
5) <wild_speculation>we're not likely to miss content because we're grabbing <w:t> wherever they are (
TIKA-1317 and friends). </wild_speculation>
On the down side...this re-invents several helper classes from POI and Tika , which I really, really regret.
1. Nick Burch and fellow devs, how does this look commit? Anything crazy that ought to be fixed, including the mime-type?
2. Is there any way to move most of this into POI? The current OPCPackage and the rest of the code appears to be tightly tied to ZipPackage and beans. I could add this stuff as a standalone streaming/readonly xwpf set of objects, but do we want that in POI?
3. What do you think of converting our current docx processing to these classes? I don't think it would take much to rework a bit to pull the related bits from the zip and then process the document.xml as we're currently doing.