I noticed that when we parse an embedded Office document, it's
inefficient because we take the NPOIFileSystem we had already parsed
(from the full document) and write the "sub-directory" containing the
embedded document to a temp file, only to re-parse it again once we've
recursed to the inner detector/parser.
I worked out a patch to instead just directly pass the sub-directory
of the embedded document directly to the inner detector/parser.
This gives a good speedup in my test case: I have a private test set
of 2,080 Word docs; parsing them (and their embedded docs) takes 16.1
on trunk and 10.7 sec with this patch – 34% faster (best of 10).
The change has a few parts:
- Fixed all Office parsers to alternatively directly take the
document root (DirectoryNode); this was straightforward (but
touched a lot of sources) because internally these parsers were
extracting that root anyway.
- Fixed AbstractPOIFSExtractor to not do the serialization to a temp
file and instead put the document's root on an otherwise empty
(new byte) TikaInputStream as the openContainer.
- Fixed OfficeParser and POIFSContainerDetector to recognize a
DirectoryNode on the incoming TikaInputStream, and parse/detect
The one catch I hit was a failure in POIContainerExtractionTest, due
to already-fixed bug 51949 in POI (NPE on double-close of
ZipFileZipEntrySource); I added a workaround in
ParsingEmbeddedDocumentExtractor for this, with a TODO to remove the
workaround once POI releases and we upgrade. It's important to remove
that because we are double-opening the ZIP archive now for embedded
I also converted a couple if/else string equal chains into HashMap