Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-948

Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2
    • Component/s: parser
    • Labels:
      None

      Description

      This is just like TIKA-704, except that issue was for an OOXML Word
      doc but this is for the older Word 97-2003 format.

      1. EmbeddedPDF.doc
        1.42 MB
        Michael McCandless
      2. TIKA-948.patch
        14 kB
        Michael McCandless
      3. TIKA-948.patch
        2 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Here's a trivial test document + test case showing the issue; if you run TikaCLI
        -z on this you'll get an embedded file extracted as _1402837031.wps,
        but that really should be a PDF.

        I traced this down a bit, into AbstractPOIFSExtractor, where it calls
        POIFSDocumentType.detectType(dir) and that (incorrectly) returns WPS.

        I think the logic in POIFSContainerDetector.detect (which guesses the
        embedded file's type by looking at the directory listing of the
        document node) is too simplistic? We may need to peek into the
        \0001CompObj contents to get the true document type (I can see, using
        POI's POIFSViewer that this seems to identify the MediaType of the
        file, and processStarDrawOrImpress already does so...).

        But I don't know the format of the bytes in \0001CompObj.

        Or maybe alternatively ... we can pull the CONTENTS bytes and
        auto-detect on that. Basically we somehow need to determine if it's
        another office format (and do what we now do) else pull the CONTENTS
        bytes and recurse on only that.

        Show
        mikemccand Michael McCandless added a comment - Here's a trivial test document + test case showing the issue; if you run TikaCLI -z on this you'll get an embedded file extracted as _1402837031.wps, but that really should be a PDF. I traced this down a bit, into AbstractPOIFSExtractor, where it calls POIFSDocumentType.detectType(dir) and that (incorrectly) returns WPS. I think the logic in POIFSContainerDetector.detect (which guesses the embedded file's type by looking at the directory listing of the document node) is too simplistic? We may need to peek into the \0001CompObj contents to get the true document type (I can see, using POI's POIFSViewer that this seems to identify the MediaType of the file, and processStarDrawOrImpress already does so...). But I don't know the format of the bytes in \0001CompObj. Or maybe alternatively ... we can pull the CONTENTS bytes and auto-detect on that. Basically we somehow need to determine if it's another office format (and do what we now do) else pull the CONTENTS bytes and recurse on only that.
        Hide
        mikemccand Michael McCandless added a comment -

        Patch... I think it's ready.

        I fixed POIFSContainerDetector to also look for PDF, by loading the
        CompObj bytes if present (I refactored the Star Impress/Draw code that
        was already doing so), and fixed AbstractPOIFSExtractor to handle
        non-office embedded documents (well, only PDF for now).

        It would good to fix all other non-office document types
        that may be embedded ... but I think we can do that in future issue(s).

        Show
        mikemccand Michael McCandless added a comment - Patch... I think it's ready. I fixed POIFSContainerDetector to also look for PDF, by loading the CompObj bytes if present (I refactored the Star Impress/Draw code that was already doing so), and fixed AbstractPOIFSExtractor to handle non-office embedded documents (well, only PDF for now). It would good to fix all other non-office document types that may be embedded ... but I think we can do that in future issue(s).
        Hide
        gagravarr Nick Burch added a comment -

        Should we not just pass the bytes to a Detector if we have a CompObj but nothing else looks promising?

        Show
        gagravarr Nick Burch added a comment - Should we not just pass the bytes to a Detector if we have a CompObj but nothing else looks promising?
        Hide
        alexott Alex Ott added a comment - - edited

        Maybe you also reuse information from prop stream nearby of CompObj?

        P.S. Format of CompObj is described in MS's documentation, as I remember, there are 2 flavors of them, with different offsets of actual data

        Show
        alexott Alex Ott added a comment - - edited Maybe you also reuse information from prop stream nearby of CompObj? P.S. Format of CompObj is described in MS's documentation, as I remember, there are 2 flavors of them, with different offsets of actual data
        Hide
        gagravarr Nick Burch added a comment -

        I've had a go at fixing this in r1358404, using a slightly different method. Michael's suggested fix didn't seem to scale well to additional formats, and would've been a bit confusing for the extractors.

        Instead, I've updated POIFSContainerDetector to report that a CompObj that isn't a known "native CompObj" has been found (older Works file, some older Star files etc are all natively stored). This is reported in a similar way to the OLE10Native stuff, which was slightly tweaked. I've then updated the AbstarctPOIFSExtractor to then pull the data out of the CompObj structure. Tika-App -z is then able to correctly extract the pdf out.

        One remaining issue is with file extensions, a little bit more work is needed to do that cleanly I think

        Show
        gagravarr Nick Burch added a comment - I've had a go at fixing this in r1358404, using a slightly different method. Michael's suggested fix didn't seem to scale well to additional formats, and would've been a bit confusing for the extractors. Instead, I've updated POIFSContainerDetector to report that a CompObj that isn't a known "native CompObj" has been found (older Works file, some older Star files etc are all natively stored). This is reported in a similar way to the OLE10Native stuff, which was slightly tweaked. I've then updated the AbstarctPOIFSExtractor to then pull the data out of the CompObj structure. Tika-App -z is then able to correctly extract the pdf out. One remaining issue is with file extensions, a little bit more work is needed to do that cleanly I think
        Hide
        gagravarr Nick Burch added a comment -

        I think r1358467 should fix the file extension problem in a general way for all (known) file formats. The only thing is we don't currently make use of things like "Adobe Acrobat" in the CompObj to short-cut the detection, it's done in a general way for all formats known or unknown. Not sure if we need to fix this or not though

        Show
        gagravarr Nick Burch added a comment - I think r1358467 should fix the file extension problem in a general way for all (known) file formats. The only thing is we don't currently make use of things like "Adobe Acrobat" in the CompObj to short-cut the detection, it's done in a general way for all formats known or unknown. Not sure if we need to fix this or not though
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks for taking this Nick! Can you add a CHANGES entry for it?

        It seems odd not using the explicitly declared file type inside the CompObj and running detection on the byte[] CONTENTS instead, but it is definitely better than my original patch since it can handle all formats we can detect (not just PDF), so +1. We can separately (later) decode CompObj if that ever seems necessary...

        Also: do we have any test data for that Quill96 (that I almost broke!)? Would be nice to have test coverage for it...

        Show
        mikemccand Michael McCandless added a comment - Thanks for taking this Nick! Can you add a CHANGES entry for it? It seems odd not using the explicitly declared file type inside the CompObj and running detection on the byte[] CONTENTS instead, but it is definitely better than my original patch since it can handle all formats we can detect (not just PDF), so +1. We can separately (later) decode CompObj if that ever seems necessary... Also: do we have any test data for that Quill96 (that I almost broke!)? Would be nice to have test coverage for it...
        Hide
        gagravarr Nick Burch added a comment -

        If someone feels keen, we could add CompObj decoding. When that's there, we could try to extract helpful data (such as file type) out. However, it doesn't look like it stores the file type as such, just the application the data should be passed to, which might well not be specific enough

        The Quill96 thing comes from an old works file, from an old bug report somewhere. IIRC we didn't get a sample file to go with it, but it might be good for someone to work out which mailing list discussion / old issue it's associated with to check...

        Show
        gagravarr Nick Burch added a comment - If someone feels keen, we could add CompObj decoding. When that's there, we could try to extract helpful data (such as file type) out. However, it doesn't look like it stores the file type as such, just the application the data should be passed to, which might well not be specific enough The Quill96 thing comes from an old works file, from an old bug report somewhere. IIRC we didn't get a sample file to go with it, but it might be good for someone to work out which mailing list discussion / old issue it's associated with to check...
        Hide
        mikemccand Michael McCandless added a comment -

        However, it doesn't look like it stores the file type as such, just the application the data should be passed to, which might well not be specific enough

        Ahh, that's a good point...

        The Quill96 thing comes from an old works file, from an old bug report somewhere. IIRC we didn't get a sample file to go with it, but it might be good for someone to work out which mailing list discussion / old issue it's associated with to check...

        Oh in fact it looks like we do have test coverage for this (TestContainerAwareDetector.testDetectOLE2), so all is good.

        Show
        mikemccand Michael McCandless added a comment - However, it doesn't look like it stores the file type as such, just the application the data should be passed to, which might well not be specific enough Ahh, that's a good point... The Quill96 thing comes from an old works file, from an old bug report somewhere. IIRC we didn't get a sample file to go with it, but it might be good for someone to work out which mailing list discussion / old issue it's associated with to check... Oh in fact it looks like we do have test coverage for this (TestContainerAwareDetector.testDetectOLE2), so all is good.
        Hide
        mikemccand Michael McCandless added a comment -

        I think Nick's commits fixed this.

        Show
        mikemccand Michael McCandless added a comment - I think Nick's commits fixed this.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development