Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1761

Error Parsing PPT (97-2003) files with password protection against modification which were created using Office 2013

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.7, 1.10
    • None
    • parser
    • None

    Description

      PPT documents created (or saved) as Powerpoint 97-2003 format and protected with password against modification using Office 2013 fail during extracting text.
      But it works fine Powerpoint 97-2003 format using Office 2007

      java -jar tika-app-1.10.jar --text test_2003.ppt
      Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@22b0f5af
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
              at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:185)
              at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
              at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
      Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException: PowerPoint file is encrypted. The correct password needs to be set via Biff8EncryptionKey.setCurrentUserPassword()
              at org.apache.poi.hslf.EncryptedSlideShow.<init>(EncryptedSlideShow.java:102)
              at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:259)
              at org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:250)
              at org.apache.poi.hslf.HSLFSlideShow.<init>(HSLFSlideShow.java:165)
              at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
              at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
              at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              ... 5 more
      

      I've debugged Tika library and found that it fails UserEditAtom.encryptSessionPersistIdRef property. This property is empty in files created with Office 2007 and no-empty with Office 2013.
      I've defragmented PPT files as described in https://social.msdn.microsoft.com/Forums/en-US/e33189a5-0b00-44b7-b084-f2757e9b7536/powerpoint-binary-file-format-decryption?forum=os_binaryfile

      Is this bug of Tika or POI library?
      Should be it supported per Apache POI encryption support?

      Attachments

        1. test-2007.ppt
          98 kB
          Andriy Budzinskyy
        2. test-2013.ppt
          419 kB
          Andriy Budzinskyy

        Issue Links

          Activity

            People

              tallison Tim Allison
              andriy.budzinskyy Andriy Budzinskyy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: