Can you also check that parsing styles.xml of e.g. writer or calc documents does no harm?
Good idea, Uwe!; I tested this.
On a fresh Writer (.odt) doc, no text comes out of the styles.xml
(good). If I then edit the footer, Tika misses that text today, but
the patch gets it (I added a test).
On a fresh Calc (.ods) doc, there is some minor "placeholder" text:
<p>Page / 99 </p>
I've fixed the "99" by also filtering for "text:page-count" in ODCP;
the date/time is apparently when the doc was created; I think the rest
of the boiler plate text is acceptable? EG, you can see this text
(Page 1) when you do Page Preview or print...
When I then edited the footer in the Calc doc, Tika misses that text
today, but the patch gets it (I added a test for this too).
About the order: I have it somewhere in the back of my head, that the order of files in the ZIP file is somehow part of the standard. At least I know, that the MIME_TYPE file must be the first one in the ZIP file, to make detection of format easy.
I haven't been able to find mention of this in the spec... I'm looking
at http://docs.oasis-open.org/office/v1.1/OS/OpenDocument-v1.1.odt and
it just describes the general ZIP format as far as I can tell...
I still dont get the reason for problems with metadata if the order of files is different.
Oh, this is because XHTMLContentHandler, on seeing the end of header /
start of body will output <meta> tags for all metadata present in the
Metadata class at that time. So... if new entries are added to
Metadata after the body tag is started they won't make it into the
<head>...</head>. Looks like this was done under