Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2588

Tika detecting/parsing pptx with embedded Excel worksheet(s)...

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.17
    • Fix Version/s: 1.18, 2.0.0
    • Component/s: detector, parser
    • Labels:
      None
    • Environment:

       

      Description

      Hello tika-developers,

      First, a big 'thank-you' for creating and maintaining Apache-Tika!  A really useful capability/service that can be used in so many different ways.  You folks are the true Debabelizer (h2g2.com).

      On to issue-encountered: using Tika 1.17 to extract an embedded Excel object out of a pptx is causing issues.  Simple example attached to this Jira-issue (tikaSample.pptx) which if run against Tika 1.17 (with verbose/list-parsers/list-detectors) provides the output in (foo.out).  The deck contains a title slide, and a single-slide with embedded Excel object on it.

      As noted to Nick Burch on S-Overflow, I grabbed the unit-test data which you use in your parser/office JUnit suite (test_ppt_embedded_two_slides.pptx) and tried opening in Office/PPT 2016.  I selected (with mouse) the embedded sheet (had Alfresco logo in it) and pasted it into an empty Office/Excel 2016 workbook.  When I tried to interact with it, I had to double-click to make it active.  As a result, I ended up with two Excel instances on my Windows 10 desktop (the original object in 1, the Excel worksheet in another).  I have included a picture of the embedded Excel object pasted into the workbook... ).

      followed by the worksheet opened inside the workbook (required double-click within the black-bordered area in the first pic above):

      I managed to extract the embedded object using apache POI.  The logic sequence was something like the following:

      1. Create an XMLSlideShow object, and pull the list of underlying slide entities.
      2. Walk the list of XSLFSlide(s), searching for a matching slide (by name) - e.g. 'MFL'.
      3. Examine PackagePart of XSLFSlide (matching name) and for content-type.
      4. If pPart.content-type is 'application/vnd.openxmlformats-officedocument.oleObject' then - 'candidate FOUND'.
      5. Build POIFS around the candidate FOUND, extract root of FileSystem.
      6. Verify that root has entries for { 'Package', '\u0001Ole', and '\u0001CompObj' }.
      7. Extract entry '\u0001CompObj', verify entry is a DocumentEntry and underlying bytes for DocumentNode match an 'Excel' signature.
      8. If (step 7 is true) -> extract entry 'Package'.
      9. The resulting entry represents the byte-stream of the embedded Excel entity.

      I was able to instantiate this into a new workbook (as an example) using POI, and when I opened it, the worksheet was correctly embedded in that 'example.xlsx'.

      I am not as familiar with Tika, so was a little less comfortable trying to walk it through.  I thought however, recreating this path would provide further insight for you.

        Attachments

        1. foo.out
          6 kB
          Brian McColgan
        2. pptEmbedExcelDoubleClickFromWorkbook.PNG
          133 kB
          Brian McColgan
        3. pptEmbedExcelInEmptyWorkbook.PNG
          65 kB
          Brian McColgan
        4. tikaSample.pptx
          43 kB
          Brian McColgan

          Issue Links

            Activity

              People

              • Assignee:
                tallison@apache.org Tim Allison
                Reporter:
                gto406 Brian McColgan
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: