Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2311

Preserve "x-tika-ooxml" mime value for truncated ooxml files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      The following is an unintended consequence of TIKA-2212.

      The OOXML parser used to handle x-tika-ooxml. We have some truncated ooxml files in our regression corpus. The previous behavior was:

      1) ZipPackage detector caught the zip truncation exception and returned "application/zip"
      2) The mime detector recognized magic and returned x-tika-ooxml
      3) The file was then routed to the OOXML parser which didn't wind up doing much with the content because it hit the zip exception early on, but the final mime type was x-tika-ooxml.

      The current behavior
      1) Same detection steps
      2) However, because the OOXML parser no longer handles x-tika-ooxml, the file is handled by the Package Parser, which overwrites the magic-determined mime type, and the new mime type is application/zip.
      3) Some content is extracted because the Package parser handles the zip entries in order and only throws the exception once it hits the last entry in the zip file.

      Ideally, I'd like to keep the magic-determined mime detection. Once we can chain parsers, the user should be able to backoff to the PackageParser, but I don't think this should be the default behavior.

      One solution would be to create a new mime type that is not the parent of the other ooxml subtypes, but is itself a leaf subtype, something like: x-tika-ooxml-unk.

      Any objections/other recommendations?

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Jenkins build tika-2.x-windows #203 (See https://builds.apache.org/job/tika-2.x-windows/203/)
          TIKA-2311 – try OPC before ZipFile. This can work better on some (tallison: rev 6930ff0251e9e93ee969a9f1287c902d31045b59)

          • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/opc/OPCDetector.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x-windows #203 (See https://builds.apache.org/job/tika-2.x-windows/203/ ) TIKA-2311 – try OPC before ZipFile. This can work better on some (tallison: rev 6930ff0251e9e93ee969a9f1287c902d31045b59) (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/opc/OPCDetector.java
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Jenkins build tika-2.x #250 (See https://builds.apache.org/job/tika-2.x/250/)
          TIKA-2311 – try OPC before ZipFile. This can work better on some (tallison: rev 6930ff0251e9e93ee969a9f1287c902d31045b59)

          • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/opc/OPCDetector.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x #250 (See https://builds.apache.org/job/tika-2.x/250/ ) TIKA-2311 – try OPC before ZipFile. This can work better on some (tallison: rev 6930ff0251e9e93ee969a9f1287c902d31045b59) (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/opc/OPCDetector.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1253 (See https://builds.apache.org/job/Tika-trunk/1253/)
          TIKA-2311 – to handle truncated files more robustly, in (tallison: https://github.com/apache/tika/commit/0b37895ef0eae560444e3078b7fa33f6fec11eca)

          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1253 (See https://builds.apache.org/job/Tika-trunk/1253/ ) TIKA-2311 – to handle truncated files more robustly, in (tallison: https://github.com/apache/tika/commit/0b37895ef0eae560444e3078b7fa33f6fec11eca ) (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Will reopen if that didn't fix the problem...

          Show
          tallison@mitre.org Tim Allison added a comment - Will reopen if that didn't fix the problem...
          Hide
          tallison@mitre.org Tim Allison added a comment -

          example xlsm file that cannot be opened by MSWord, but which can be parsed without an exception by Tika 1.14.

          Show
          tallison@mitre.org Tim Allison added a comment - example xlsm file that cannot be opened by MSWord, but which can be parsed without an exception by Tika 1.14.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Regression corpus shows the initial fix is not enough.

          There are some truncated files in our regression corpus that can be opened by OPCPackage but cannot be opened by ZipFile. These files can be fully parsed by the OOXMLParser without an exception – the truncated embedded file is never required for parsing.

          If we flip the order in detectZipFormat to open an OPCPackage and test for ooxml before opening a ZipFile, we get the same behavior as in 1.14 on our unit tests and on the handful of example files from our regression set. We'll need to confirm that this doesn't lead to any unforeseen behavior in other files in the regression set.

          Show
          tallison@mitre.org Tim Allison added a comment - Regression corpus shows the initial fix is not enough. There are some truncated files in our regression corpus that can be opened by OPCPackage but cannot be opened by ZipFile. These files can be fully parsed by the OOXMLParser without an exception – the truncated embedded file is never required for parsing. If we flip the order in detectZipFormat to open an OPCPackage and test for ooxml before opening a ZipFile, we get the same behavior as in 1.14 on our unit tests and on the handful of example files from our regression set. We'll need to confirm that this doesn't lead to any unforeseen behavior in other files in the regression set.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1239 (See https://builds.apache.org/job/Tika-trunk/1239/)
          TIKA-2311 – maintain mime information for truncated ooxml (tallison: https://github.com/apache/tika/commit/3aab15f8f277614e3c5783c4862e25d63b737425)

          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (add) tika-parsers/src/test/resources/test-documents/testWORD_truncated.docx
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1239 (See https://builds.apache.org/job/Tika-trunk/1239/ ) TIKA-2311 – maintain mime information for truncated ooxml (tallison: https://github.com/apache/tika/commit/3aab15f8f277614e3c5783c4862e25d63b737425 ) (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (add) tika-parsers/src/test/resources/test-documents/testWORD_truncated.docx (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #243 (See https://builds.apache.org/job/tika-2.x/243/)
          TIKA-2311 – maintain x-tika-ooxml mime type for truncated ooxml (tallison: rev 143efc8d92735099f5077956d8f257aad106321a)

          • (edit) tika-app/src/test/java/org/apache/tika/parser/pkg/PackageTest.java
          • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_truncated.docx
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          • (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #243 (See https://builds.apache.org/job/tika-2.x/243/ ) TIKA-2311 – maintain x-tika-ooxml mime type for truncated ooxml (tallison: rev 143efc8d92735099f5077956d8f257aad106321a) (edit) tika-app/src/test/java/org/apache/tika/parser/pkg/PackageTest.java (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_truncated.docx (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java
          Hide
          hudson Hudson added a comment -

          UNSTABLE: Integrated in Jenkins build tika-2.x-windows #197 (See https://builds.apache.org/job/tika-2.x-windows/197/)
          TIKA-2311 – maintain x-tika-ooxml mime type for truncated ooxml (tallison: rev 143efc8d92735099f5077956d8f257aad106321a)

          • (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java
          • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
          • (add) tika-test-resources/src/test/resources/test-documents/testWORD_truncated.docx
          • (edit) tika-app/src/test/java/org/apache/tika/parser/pkg/PackageTest.java
          • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          Show
          hudson Hudson added a comment - UNSTABLE: Integrated in Jenkins build tika-2.x-windows #197 (See https://builds.apache.org/job/tika-2.x-windows/197/ ) TIKA-2311 – maintain x-tika-ooxml mime type for truncated ooxml (tallison: rev 143efc8d92735099f5077956d8f257aad106321a) (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/TarParserTest.java (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java (add) tika-test-resources/src/test/resources/test-documents/testWORD_truncated.docx (edit) tika-app/src/test/java/org/apache/tika/parser/pkg/PackageTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParserTest.java
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Thank you, for the feedback, Nick Burch and Luis Filipe Nassif!

          I kept our distinction, if I understand correctly, btwn posix and gnu tar by modifying the unit test.

          Show
          tallison@mitre.org Tim Allison added a comment - Thank you, for the feedback, Nick Burch and Luis Filipe Nassif ! I kept our distinction, if I understand correctly, btwn posix and gnu tar by modifying the unit test.
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Ha, Nifi overrides our def and just calls it .tar: here

          Should we do the same?

          Show
          tallison@mitre.org Tim Allison added a comment - Ha, Nifi overrides our def and just calls it .tar : here Should we do the same?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          When I use a static MediaTypesRegistry in PackageParser, the new unit test passes on a truncated docx, however test-documents.tar is now detected as "x-gtar".

          Looking at the definition:

            <mime-type type="application/x-gtar">
              <_comment>GNU tar Compressed File Archive (GNU Tape Archive)</_comment>
                <magic priority="50">
                     GNU tar archive
                <match value="ustar  \0" type="string" offset="257" />
              </magic>
              <glob pattern="*.gtar"/>
              <sub-class-of type="application/x-tar"/>
            </mime-type>
          

          Is this really the mime for ustar not gtar???

          Show
          tallison@mitre.org Tim Allison added a comment - When I use a static MediaTypesRegistry in PackageParser, the new unit test passes on a truncated docx, however test-documents.tar is now detected as "x-gtar". Looking at the definition: <mime-type type="application/x-gtar"> <_comment>GNU tar Compressed File Archive (GNU Tape Archive)</_comment> <magic priority="50"> GNU tar archive <match value="ustar \0" type="string" offset="257" /> </magic> <glob pattern="*.gtar"/> <sub-class-of type="application/x-tar"/> </mime-type> Is this really the mime for ustar not gtar ???
          Hide
          tallison@mitre.org Tim Allison added a comment -

          Y, this is far preferable.

          Is there a lightweight way to grab the MediaTypeRegistry in order to do the checking for "child of zip"? Create singleton in PackageParser? Or a better way?

          Show
          tallison@mitre.org Tim Allison added a comment - Y, this is far preferable. Is there a lightweight way to grab the MediaTypeRegistry in order to do the checking for "child of zip"? Create singleton in PackageParser? Or a better way?
          Hide
          gagravarr Nick Burch added a comment -

          How about we have package parser say "if no mimetype set or current mimetype is not based on zip, set to zip. If current mimetype is a child of zip, leave unchanged" ?

          Show
          gagravarr Nick Burch added a comment - How about we have package parser say "if no mimetype set or current mimetype is not based on zip, set to zip. If current mimetype is a child of zip, leave unchanged" ?
          Hide
          tallison@mitre.org Tim Allison added a comment -

          I thought about that. I was uncomfortable with the tight coupling of checking if "x-tika-ooxml" had already been set in the PackageParser. Or, should we have a more general check if Content-Type has already been set, do not overwrite?

          I prefer this option to my original proposal.

          Show
          tallison@mitre.org Tim Allison added a comment - I thought about that. I was uncomfortable with the tight coupling of checking if "x-tika-ooxml" had already been set in the PackageParser. Or, should we have a more general check if Content-Type has already been set, do not overwrite? I prefer this option to my original proposal.
          Hide
          lfcnassif Luis Filipe Nassif added a comment -

          Hi Tim,

          What about removing the mimetype overwrite logic from PackageParser? So truncated ooxml will continue to be x-tika-ooxml and will be handled by PackageParser or other future parser for generics ooxml?

          Show
          lfcnassif Luis Filipe Nassif added a comment - Hi Tim, What about removing the mimetype overwrite logic from PackageParser? So truncated ooxml will continue to be x-tika-ooxml and will be handled by PackageParser or other future parser for generics ooxml?

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development