Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2588

Tika detecting/parsing pptx with embedded Excel worksheet(s)...

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.17
    • 1.18, 2.0.0
    • detector, parser
    • None
    •  

    Description

      Hello tika-developers,

      First, a big 'thank-you' for creating and maintaining Apache-Tika!  A really useful capability/service that can be used in so many different ways.  You folks are the true Debabelizer (h2g2.com).

      On to issue-encountered: using Tika 1.17 to extract an embedded Excel object out of a pptx is causing issues.  Simple example attached to this Jira-issue (tikaSample.pptx) which if run against Tika 1.17 (with verbose/list-parsers/list-detectors) provides the output in (foo.out).  The deck contains a title slide, and a single-slide with embedded Excel object on it.

      As noted to gagravarr on S-Overflow, I grabbed the unit-test data which you use in your parser/office JUnit suite (test_ppt_embedded_two_slides.pptx) and tried opening in Office/PPT 2016.  I selected (with mouse) the embedded sheet (had Alfresco logo in it) and pasted it into an empty Office/Excel 2016 workbook.  When I tried to interact with it, I had to double-click to make it active.  As a result, I ended up with two Excel instances on my Windows 10 desktop (the original object in 1, the Excel worksheet in another).  I have included a picture of the embedded Excel object pasted into the workbook... ).

      followed by the worksheet opened inside the workbook (required double-click within the black-bordered area in the first pic above):

      I managed to extract the embedded object using apache POI.  The logic sequence was something like the following:

      1. Create an XMLSlideShow object, and pull the list of underlying slide entities.
      2. Walk the list of XSLFSlide(s), searching for a matching slide (by name) - e.g. 'MFL'.
      3. Examine PackagePart of XSLFSlide (matching name) and for content-type.
      4. If pPart.content-type is 'application/vnd.openxmlformats-officedocument.oleObject' then - 'candidate FOUND'.
      5. Build POIFS around the candidate FOUND, extract root of FileSystem.
      6. Verify that root has entries for { 'Package', '\u0001Ole', and '\u0001CompObj' }.
      7. Extract entry '\u0001CompObj', verify entry is a DocumentEntry and underlying bytes for DocumentNode match an 'Excel' signature.
      8. If (step 7 is true) -> extract entry 'Package'.
      9. The resulting entry represents the byte-stream of the embedded Excel entity.

      I was able to instantiate this into a new workbook (as an example) using POI, and when I opened it, the worksheet was correctly embedded in that 'example.xlsx'.

      I am not as familiar with Tika, so was a little less comfortable trying to walk it through.  I thought however, recreating this path would provide further insight for you.

      Attachments

        1. foo.out
          6 kB
          Brian McColgan
        2. pptEmbedExcelDoubleClickFromWorkbook.PNG
          133 kB
          Brian McColgan
        3. pptEmbedExcelInEmptyWorkbook.PNG
          65 kB
          Brian McColgan
        4. tikaSample.pptx
          43 kB
          Brian McColgan

        Issue Links

          Activity

            People

              tallison Tim Allison
              gto406 Brian McColgan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: