Here's a trivial test document + test case showing the issue; if you run TikaCLI
-z on this you'll get an embedded file extracted as _1402837031.wps,
but that really should be a PDF.
I traced this down a bit, into AbstractPOIFSExtractor, where it calls
POIFSDocumentType.detectType(dir) and that (incorrectly) returns WPS.
I think the logic in POIFSContainerDetector.detect (which guesses the
embedded file's type by looking at the directory listing of the
document node) is too simplistic? We may need to peek into the
\0001CompObj contents to get the true document type (I can see, using
POI's POIFSViewer that this seems to identify the MediaType of the
file, and processStarDrawOrImpress already does so...).
But I don't know the format of the bytes in \0001CompObj.
Or maybe alternatively ... we can pull the CONTENTS bytes and
auto-detect on that. Basically we somehow need to determine if it's
another office format (and do what we now do) else pull the CONTENTS
bytes and recurse on only that.