Tika
  1. Tika
  2. TIKA-753

Improve performance when parsing embedded Office docs

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None
    1. TIKA-753.patch
      35 kB
      Michael McCandless

      Activity

      Hide
      Michael McCandless added a comment -

      I noticed that when we parse an embedded Office document, it's
      inefficient because we take the NPOIFileSystem we had already parsed
      (from the full document) and write the "sub-directory" containing the
      embedded document to a temp file, only to re-parse it again once we've
      recursed to the inner detector/parser.

      I worked out a patch to instead just directly pass the sub-directory
      of the embedded document directly to the inner detector/parser.

      This gives a good speedup in my test case: I have a private test set
      of 2,080 Word docs; parsing them (and their embedded docs) takes 16.1
      on trunk and 10.7 sec with this patch – 34% faster (best of 10).

      The change has a few parts:

      • Fixed all Office parsers to alternatively directly take the
        document root (DirectoryNode); this was straightforward (but
        touched a lot of sources) because internally these parsers were
        extracting that root anyway.
      • Fixed AbstractPOIFSExtractor to not do the serialization to a temp
        file and instead put the document's root on an otherwise empty
        (new byte[0]) TikaInputStream as the openContainer.
      • Fixed OfficeParser and POIFSContainerDetector to recognize a
        DirectoryNode on the incoming TikaInputStream, and parse/detect
        that directly.

      The one catch I hit was a failure in POIContainerExtractionTest, due
      to already-fixed bug 51949 in POI (NPE on double-close of
      ZipFileZipEntrySource); I added a workaround in
      ParsingEmbeddedDocumentExtractor for this, with a TODO to remove the
      workaround once POI releases and we upgrade. It's important to remove
      that because we are double-opening the ZIP archive now for embedded
      OOXML docs...

      I also converted a couple if/else string equal chains into HashMap
      lookups.

      Show
      Michael McCandless added a comment - I noticed that when we parse an embedded Office document, it's inefficient because we take the NPOIFileSystem we had already parsed (from the full document) and write the "sub-directory" containing the embedded document to a temp file, only to re-parse it again once we've recursed to the inner detector/parser. I worked out a patch to instead just directly pass the sub-directory of the embedded document directly to the inner detector/parser. This gives a good speedup in my test case: I have a private test set of 2,080 Word docs; parsing them (and their embedded docs) takes 16.1 on trunk and 10.7 sec with this patch – 34% faster (best of 10). The change has a few parts: Fixed all Office parsers to alternatively directly take the document root (DirectoryNode); this was straightforward (but touched a lot of sources) because internally these parsers were extracting that root anyway. Fixed AbstractPOIFSExtractor to not do the serialization to a temp file and instead put the document's root on an otherwise empty (new byte [0] ) TikaInputStream as the openContainer. Fixed OfficeParser and POIFSContainerDetector to recognize a DirectoryNode on the incoming TikaInputStream, and parse/detect that directly. The one catch I hit was a failure in POIContainerExtractionTest, due to already-fixed bug 51949 in POI (NPE on double-close of ZipFileZipEntrySource); I added a workaround in ParsingEmbeddedDocumentExtractor for this, with a TODO to remove the workaround once POI releases and we upgrade. It's important to remove that because we are double-opening the ZIP archive now for embedded OOXML docs... I also converted a couple if/else string equal chains into HashMap lookups.
      Hide
      Michael McCandless added a comment -

      Patch.

      Show
      Michael McCandless added a comment - Patch.
      Hide
      Nick Burch added a comment -

      Patch looks fine to me

      I've added a static constructor on Ole10Native to create from a DirectoryNode, as well as POIFSFileSystem, which'll let that bit be tidied up later

      Show
      Nick Burch added a comment - Patch looks fine to me I've added a static constructor on Ole10Native to create from a DirectoryNode, as well as POIFSFileSystem, which'll let that bit be tidied up later
      Hide
      Michael McCandless added a comment -

      Awesome, thanks Nick!

      I'll add a TODO where we use Ole10Native to cutover to DirectoryNode once we upgrade POI. And then I guess leave this issue open after committing to remind us to go back and do these two TODOs...

      Show
      Michael McCandless added a comment - Awesome, thanks Nick! I'll add a TODO where we use Ole10Native to cutover to DirectoryNode once we upgrade POI. And then I guess leave this issue open after committing to remind us to go back and do these two TODOs...
      Hide
      Michael McCandless added a comment -

      OK I committed this; I'll leave it open so we remember to do the TODOs on next POI upgrade.

      Show
      Michael McCandless added a comment - OK I committed this; I'll leave it open so we remember to do the TODOs on next POI upgrade.
      Hide
      Michael McCandless added a comment -

      I opened TIKA-757 as the blanket issue for addressing TODOs on next POI upgrade so I think we can now resolve this one.

      Show
      Michael McCandless added a comment - I opened TIKA-757 as the blanket issue for addressing TODOs on next POI upgrade so I think we can now resolve this one.

        People

        • Assignee:
          Michael McCandless
          Reporter:
          Michael McCandless
        • Votes:
          0 Vote for this issue
          Watchers:
          0 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development