Tika
  1. Tika
  2. TIKA-705

Valid OOXML PPT file hits InvalidFormatException thrown in POI

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      I took the "testRTFVarious.rtf" test case from TIKA-683, and saved it as various other doc types, to generate more test cases.

      But when I did this for PPTX, the resulting file hits this exception:

      Exception in thread "main" org.apache.tika.exception.TikaException: Broken OOXML file
      	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:141)
      	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:112)
      	at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:95)
      	at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
      	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
      	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:363)
      	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
      Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A segment shall not hold any characters other than pchar characters. [M1.6]
      	at org.apache.poi.openxml4j.opc.PackagePartName.checkPCharCompliance(PackagePartName.java:370)
      	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfPartNameHaveInvalidSegments(PackagePartName.java:270)
      	at org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:185)
      	at org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:83)
      	at org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:490)
      	at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEmbeddedParts(AbstractOOXMLExtractor.java:124)
      	... 9 more
      

      All I did was open Office 2007, copy/paste over the text from the Word doc, and save it. Ie, it should be a valid OOXML file, unless Office 2007 is buggy?

      1. testPPT_various.pptx
        47 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        PPTX file showing the exception.

        Show
        Michael McCandless added a comment - PPTX file showing the exception.
        Hide
        Nick Burch added a comment -

        Looks to be a problem with a reference to part of a slide, rather than a whole slide:
        /ppt/slides/slide1.xml#_ftn1

        Show
        Nick Burch added a comment - Looks to be a problem with a reference to part of a slide, rather than a whole slide: /ppt/slides/slide1.xml#_ftn1
        Hide
        Michael McCandless added a comment -

        Thanks for looking at this Nick!

        So, is this something I somehow screwed up using Powerpoint 2007? Or PowerPoint 2007 is simply producing an invalid OOXML file?

        Is there anything we (or POI) can do here? It's bad if users can produce things "normally" (ie just using PowerPoint) which Tika then chokes on...

        Show
        Michael McCandless added a comment - Thanks for looking at this Nick! So, is this something I somehow screwed up using Powerpoint 2007? Or PowerPoint 2007 is simply producing an invalid OOXML file? Is there anything we (or POI) can do here? It's bad if users can produce things "normally" (ie just using PowerPoint) which Tika then chokes on...
        Hide
        Nick Burch added a comment -

        I'll need to read the spec to be sure, but I have a feeling it could be our issue with not removing anchors before fetching parts.

        Either way we probably want to make it easier for people to get related parts anyway, as the current method is a bit more fiddly that we really want.

        This will probably largely all be done on the POI side though, with the only Tika bit being moving to the new, simpler code once available

        Show
        Nick Burch added a comment - I'll need to read the spec to be sure, but I have a feeling it could be our issue with not removing anchors before fetching parts. Either way we probably want to make it easier for people to get related parts anyway, as the current method is a bit more fiddly that we really want. This will probably largely all be done on the POI side though, with the only Tika bit being moving to the new, simpler code once available
        Hide
        Jukka Zitting added a comment -

        Removing from the 0.10 roadmap, let's set the fix version to the next release once the fix is in.

        Show
        Jukka Zitting added a comment - Removing from the 0.10 roadmap, let's set the fix version to the next release once the fix is in.
        Hide
        Nick Burch added a comment -

        Initial workaround committed in r1172690.

        The proper fix is commented out in the code, and can be activated when we upgrade to POI 3.8 beta 5 (I've added a new method there)

        Show
        Nick Burch added a comment - Initial workaround committed in r1172690. The proper fix is commented out in the code, and can be activated when we upgrade to POI 3.8 beta 5 (I've added a new method there)
        Hide
        Michael McCandless added a comment -

        Thanks Nick!

        I verified that the testVarious test case (in OOXMLParserTest) now passes (I had left it commented out), so I'll go uncomment & commit.

        Show
        Michael McCandless added a comment - Thanks Nick! I verified that the testVarious test case (in OOXMLParserTest) now passes (I had left it commented out), so I'll go uncomment & commit.
        Hide
        Michael McCandless added a comment -

        I think we can resolve this now? TIKA-757 is open to address TODOs on next POI upgrade.

        Show
        Michael McCandless added a comment - I think we can resolve this now? TIKA-757 is open to address TODOs on next POI upgrade.
        Hide
        Nick Burch added a comment -

        Code simplified in r1221115 now that we've upgraded POI

        Show
        Nick Burch added a comment - Code simplified in r1221115 now that we've upgraded POI

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development