Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-751

Small improvements to how embedded docs are parsed in AbstractPOIFSExtractor.handleEmbeddedOfficeDoc

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      I noticed some minor things in this method:

      • It does too much work (writes the tmpFile out) if the
        EmbeddedDocumentExtractor didn't want to actually parse file
        file.
      • It writes the tmpFile when it won't use it in the OLE10_NATIVE
        case (because we use a TikeInputStream from the in-RAM byte[]
        instead).

      Also I fixed a typo in the method name (embeded -> embedded) – is
      that OK? It's a protected method, and a few of the office parsers
      invoke it.

      Finally I cutover to TemporaryResources to track the possible tmpFile
      and open TikaInputStream against it.

      Separately, it's inefficient now that we must serialize a sub-dir
      (DirectoryEntry) in the NPOIFileSystem to a tmp file only to re-parse
      it back to an NPOIFileSystem in OfficeParser; I'd like to look into
      instead (somehow) directly passing the NPOIFileSystem's DirectoryEntry
      to OfficeParser... but that looks like a bigger change.

        Attachments

        1. TIKA-751.patch
          8 kB
          Michael McCandless

          Activity

            People

            • Assignee:
              mikemccand Michael McCandless
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: