Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-791

Fix the detection of protected OOXML files

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.1
    • 1.1
    • mime
    • None
    • Windows 7 64 bit

    Description

      TIKA-437 patch allowed Tika to work with OOXML files protected with the default VelvetSweatshop password. I feel there is room for improvement.

      1. The POIFSContainerDetector lies when it sees such a file. It should be able to mark it as x-tika-ooxml
      2. The OOXMLParser can't work with such a file. It should:
        1. If it's protected with the default password - it should be decrypted and processed normally.
        2. If it's protected with a non-default password - the file should be marked as protected, no weird exceptions should appear.

      Therefore I'd like to add an 'if' to POIFSContainerDetector which returns x-tika-ooxml, and some code to OOXMLParser, which would be similar to the code currently residing in OfficeParser. After this improvement both the OfficeParser and the OOXMLParser will treat such files in the same way.

      When I have that, I can add a hack in my application, which will say "If the type is x-tika-ooxml and the name-based detection is a specialization of ooxml, then use the name-based detection". This will be a workaround for the fact that in MimeTypes, magic always trumps the name. With that, the encrypted DOCX files will appear with the normal DOCX mimetype in my app.

      Attachments

        1. tika-791.zip
          117 kB
          Antoni Mylka
        2. tika-791-ver2.zip
          115 kB
          Antoni Mylka

        Activity

          People

            Unassigned Unassigned
            antoni.mylka Antoni Mylka
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: