Tika
  1. Tika
  2. TIKA-736

OpenOffice parser: master footer text isn't extracted

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      If I edit the footer text on the master slide of an OpenOffice presentation, I see that text rendered on the slide, but it's not extracted by Tika.

      Digging into the document, curiously the footer text is in the styles.xml, under office:master-styles -> style:master-page -> draw:frame -> draw:text-box -> text. I think somehow we're not linking up each slide's master text elements to that slide, similar to TIKA-712.

      1. testMasterFooter.odp
        14 kB
        Michael McCandless
      2. TIKA-736.patch
        10 kB
        Michael McCandless
      3. TIKA-736.patch
        2 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Patch with failing test case.

          Show
          Michael McCandless added a comment - Patch with failing test case.
          Hide
          Nick Burch added a comment -

          It's probably not worth putting too much work into our OpenOffice parser at the moment. As soon as the ODFToolkit podling does their first incubating release, we'll switch to using a parser based on that.

          Show
          Nick Burch added a comment - It's probably not worth putting too much work into our OpenOffice parser at the moment. As soon as the ODFToolkit podling does their first incubating release, we'll switch to using a parser based on that.
          Hide
          Michael McCandless added a comment -

          OK that makes sense; hopefully it's not too long...

          Show
          Michael McCandless added a comment - OK that makes sense; hopefully it's not too long...
          Hide
          Nick Burch added a comment -

          Looking at our current parser, we don't touch the styles part, only meta and content. I think that rather than trying to add this in, we're best off waiting (+helping!) for the first odf toolkit incubating release, then switch to a full featured parser much as we have for the POI powered ones.

          Show
          Nick Burch added a comment - Looking at our current parser, we don't touch the styles part, only meta and content. I think that rather than trying to add this in, we're best off waiting (+helping!) for the first odf toolkit incubating release, then switch to a full featured parser much as we have for the POI powered ones.
          Hide
          Uwe Schindler added a comment -

          The current ODF parser is very lightweight and memory efficient (I hope ODFToolkit uses a streaming API, too, comparable with SAX).It is very elegant, but limited. I would be against a parser like the OpenXML one that builds huge DOM/Object trees as this is also somehow a "security-leak", if your parser gets a huge document that don't fits in memory and crashes you app.

          The current parser streams the document XMLs through the SAX API and converts it to HTML by replacing element names and doing some structural modifications (I wrote that one a few years ago and donated it to TIKA). I have no problem with nuking it once ODFToolkit is out, but please, please, please use a streaming API without large DOM/Object trees and temporary files. Optionally leave both parsers available (I would also take care of the current one).

          Show
          Uwe Schindler added a comment - The current ODF parser is very lightweight and memory efficient (I hope ODFToolkit uses a streaming API, too, comparable with SAX).It is very elegant, but limited. I would be against a parser like the OpenXML one that builds huge DOM/Object trees as this is also somehow a "security-leak", if your parser gets a huge document that don't fits in memory and crashes you app. The current parser streams the document XMLs through the SAX API and converts it to HTML by replacing element names and doing some structural modifications (I wrote that one a few years ago and donated it to TIKA). I have no problem with nuking it once ODFToolkit is out, but please, please, please use a streaming API without large DOM/Object trees and temporary files. Optionally leave both parsers available (I would also take care of the current one).
          Hide
          Nick Burch added a comment -

          Uwe - this might be best discussed on the "Tika is waiting for ODFToolkit to improve ODF file format processing" thread on the ODF Toolkit dev list
          <http://mail-archives.apache.org/mod_mbox/incubator-odf-dev/201110.mbox/%3CCAFJd6yTAQ=gR=Q_TGqYOrhCYjZ7EmfMzuvUXfD0ipGwAUtNh3Q@mail.gmail.com%3E> - that's where you'll find people who know what ODF Toolkit can and can't do and offer!

          Show
          Nick Burch added a comment - Uwe - this might be best discussed on the "Tika is waiting for ODFToolkit to improve ODF file format processing" thread on the ODF Toolkit dev list < http://mail-archives.apache.org/mod_mbox/incubator-odf-dev/201110.mbox/%3CCAFJd6yTAQ=gR=Q_TGqYOrhCYjZ7EmfMzuvUXfD0ipGwAUtNh3Q@mail.gmail.com%3E > - that's where you'll find people who know what ODF Toolkit can and can't do and offer!
          Hide
          Michael McCandless added a comment -

          This turned out to be fairly simple to fix, so I worked out a patch,
          and I think it's worth fixing in our current ODF parser, since we're
          not sure when we'll cutover to the ODFToolkit based solution.

          Basically I also recurse into styles.xml, using the content parser. It
          doesn't seem to have the same "problem" as PPT/PPTX (TIKA-712), where
          we the get boiler plate text out, except in one case that I could find
          (page numbers would output <number> placeholder text), so I fixed
          OpenDocumentContentParser to not output text for text:page-number
          elements (Seeparately, I noticed we don't properly extract page
          numbers for ODP files today... I'll open a new issue.)

          I also noticed because the OpenDocumentParser is strictly streaming
          (single-pass through the ZipFile), we can easily fail to insert the
          meta tags into the output XHTML, if we encounter "meta.xml" after
          "content.xml". This is maybe not so bad, because the metadata will
          still have the fields... but we could fix it, by using random-access
          ZipFile instead if we had already opened a ZipFile (eg
          AutoDetectParser), or if the IS is a TIS with a File. I put a TODO to
          do this...

          Also, I moved up the XHTMLContentHandler wrapping into
          OpenDocumentParser (from OpenDocumentContentParser), so that we don't
          emit head/body tags twice. I think we also need to do this for
          TIKA-735 too.

          This fix is not perfect, since (just like TIKA-712, for ppt/pptx) it
          outputs the master text only once (as if it were its own slide),
          instead of inlining it into each slide that referenced that master,
          but I think it's at least better than what we have today (no master
          text is extracted)... progress not perfection.

          Show
          Michael McCandless added a comment - This turned out to be fairly simple to fix, so I worked out a patch, and I think it's worth fixing in our current ODF parser, since we're not sure when we'll cutover to the ODFToolkit based solution. Basically I also recurse into styles.xml, using the content parser. It doesn't seem to have the same "problem" as PPT/PPTX ( TIKA-712 ), where we the get boiler plate text out, except in one case that I could find (page numbers would output <number> placeholder text), so I fixed OpenDocumentContentParser to not output text for text:page-number elements (Seeparately, I noticed we don't properly extract page numbers for ODP files today... I'll open a new issue.) I also noticed because the OpenDocumentParser is strictly streaming (single-pass through the ZipFile), we can easily fail to insert the meta tags into the output XHTML, if we encounter "meta.xml" after "content.xml". This is maybe not so bad, because the metadata will still have the fields... but we could fix it, by using random-access ZipFile instead if we had already opened a ZipFile (eg AutoDetectParser), or if the IS is a TIS with a File. I put a TODO to do this... Also, I moved up the XHTMLContentHandler wrapping into OpenDocumentParser (from OpenDocumentContentParser), so that we don't emit head/body tags twice. I think we also need to do this for TIKA-735 too. This fix is not perfect, since (just like TIKA-712 , for ppt/pptx) it outputs the master text only once (as if it were its own slide), instead of inlining it into each slide that referenced that master, but I think it's at least better than what we have today (no master text is extracted)... progress not perfection.
          Hide
          Uwe Schindler added a comment -

          Hi Michael,

          thanks for this simple improvement. Can you also check that parsing styles.xml of e.g. writer or calc documents does no harm?

          About the order: I have it somewhere in the back of my head, that the order of files in the ZIP file is somehow part of the standard. At least I know, that the MIME_TYPE file must be the first one in the ZIP file, to make detection of format easy. As far as I remember there was also the requirement that the metadata.xml must come before the contents.xml. Unfortunately I am not able to download the ODF spec and verify this, maybe you have one mentioning this.

          I still dont get the reason for problems with metadata if the order of files is different. The metadata is parsed to another structure and not the HTMLContentHandler, so where is the problem is content comes first? The Metadata object should be filled in all cases once the parsing process is finished.

          Show
          Uwe Schindler added a comment - Hi Michael, thanks for this simple improvement. Can you also check that parsing styles.xml of e.g. writer or calc documents does no harm? About the order: I have it somewhere in the back of my head, that the order of files in the ZIP file is somehow part of the standard. At least I know, that the MIME_TYPE file must be the first one in the ZIP file, to make detection of format easy. As far as I remember there was also the requirement that the metadata.xml must come before the contents.xml. Unfortunately I am not able to download the ODF spec and verify this, maybe you have one mentioning this. I still dont get the reason for problems with metadata if the order of files is different. The metadata is parsed to another structure and not the HTMLContentHandler, so where is the problem is content comes first? The Metadata object should be filled in all cases once the parsing process is finished.
          Hide
          Michael McCandless added a comment -

          Can you also check that parsing styles.xml of e.g. writer or calc documents does no harm?

          Good idea, Uwe!; I tested this.

          On a fresh Writer (.odt) doc, no text comes out of the styles.xml
          (good). If I then edit the footer, Tika misses that text today, but
          the patch gets it (I added a test).

          On a fresh Calc (.ods) doc, there is some minor "placeholder" text:

          <pre>
          <p>???</p>
          <p>Page</p>
          <p>??? (???)</p>
          <p>10/26/2011, 11:13:57</p>
          <p>Page / 99 </p>
          </pre>

          I've fixed the "99" by also filtering for "text:page-count" in ODCP;
          the date/time is apparently when the doc was created; I think the rest
          of the boiler plate text is acceptable? EG, you can see this text
          (Page 1) when you do Page Preview or print...

          When I then edited the footer in the Calc doc, Tika misses that text
          today, but the patch gets it (I added a test for this too).

          About the order: I have it somewhere in the back of my head, that the order of files in the ZIP file is somehow part of the standard. At least I know, that the MIME_TYPE file must be the first one in the ZIP file, to make detection of format easy.

          I haven't been able to find mention of this in the spec... I'm looking
          at http://docs.oasis-open.org/office/v1.1/OS/OpenDocument-v1.1.odt and
          it just describes the general ZIP format as far as I can tell...

          I still dont get the reason for problems with metadata if the order of files is different.

          Oh, this is because XHTMLContentHandler, on seeing the end of header /
          start of body will output <meta> tags for all metadata present in the
          Metadata class at that time. So... if new entries are added to
          Metadata after the body tag is started they won't make it into the
          <head>...</head>. Looks like this was done under TIKA-478.

          Show
          Michael McCandless added a comment - Can you also check that parsing styles.xml of e.g. writer or calc documents does no harm? Good idea, Uwe!; I tested this. On a fresh Writer (.odt) doc, no text comes out of the styles.xml (good). If I then edit the footer, Tika misses that text today, but the patch gets it (I added a test). On a fresh Calc (.ods) doc, there is some minor "placeholder" text: <pre> <p>???</p> <p>Page</p> <p>??? (???)</p> <p>10/26/2011, 11:13:57</p> <p>Page / 99 </p> </pre> I've fixed the "99" by also filtering for "text:page-count" in ODCP; the date/time is apparently when the doc was created; I think the rest of the boiler plate text is acceptable? EG, you can see this text (Page 1) when you do Page Preview or print... When I then edited the footer in the Calc doc, Tika misses that text today, but the patch gets it (I added a test for this too). About the order: I have it somewhere in the back of my head, that the order of files in the ZIP file is somehow part of the standard. At least I know, that the MIME_TYPE file must be the first one in the ZIP file, to make detection of format easy. I haven't been able to find mention of this in the spec... I'm looking at http://docs.oasis-open.org/office/v1.1/OS/OpenDocument-v1.1.odt and it just describes the general ZIP format as far as I can tell... I still dont get the reason for problems with metadata if the order of files is different. Oh, this is because XHTMLContentHandler, on seeing the end of header / start of body will output <meta> tags for all metadata present in the Metadata class at that time. So... if new entries are added to Metadata after the body tag is started they won't make it into the <head>...</head>. Looks like this was done under TIKA-478 .
          Hide
          Uwe Schindler added a comment -

          Oh, this is because XHTMLContentHandler, on seeing the end of header /
          start of body will output <meta> tags for all metadata present in the
          Metadata class at that time. So... if new entries are added to
          Metadata after the body tag is started they won't make it into the
          <head>...</head>. Looks like this was done under TIKA-478.

          Oh that was long after may initial submission of this parser I was not aware that the metadata is now also replicated into the HTML head, in addition to the separate Metadata class.

          With the current parser it can also happen that the footer/header/masterslide comes before or after the main text, depending on order of files. But for indexing purposes like Lucene its not an issue at all - this was the only reason the original version of this parser was created for (as always for PANGAEA), so order did not have any effect.

          We could work around the whole thing without the need for a random access ZIP file, if we could only serialize the body and insert the body later (e.g. using a caching sax filter)? In general the text-only part is much smaller than a zip file with large 1000dpi images, so somehow caching it might not be an issue (of course not the whole dom tree)

          Show
          Uwe Schindler added a comment - Oh, this is because XHTMLContentHandler, on seeing the end of header / start of body will output <meta> tags for all metadata present in the Metadata class at that time. So... if new entries are added to Metadata after the body tag is started they won't make it into the <head>...</head>. Looks like this was done under TIKA-478 . Oh that was long after may initial submission of this parser I was not aware that the metadata is now also replicated into the HTML head, in addition to the separate Metadata class. With the current parser it can also happen that the footer/header/masterslide comes before or after the main text, depending on order of files. But for indexing purposes like Lucene its not an issue at all - this was the only reason the original version of this parser was created for (as always for PANGAEA), so order did not have any effect. We could work around the whole thing without the need for a random access ZIP file, if we could only serialize the body and insert the body later (e.g. using a caching sax filter)? In general the text-only part is much smaller than a zip file with large 1000dpi images, so somehow caching it might not be an issue (of course not the whole dom tree)

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development