Tika
  1. Tika
  2. TIKA-801

ContentHandlerDecorator outputs invalid element

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0, 1.1
    • Fix Version/s: 1.1
    • Component/s: None
    • Labels:
      None

      Description

      • Start Tika GUI
      • try opening test-outlook.msg (from tika-parsers test resources)
      • the following exception is thrown:
        org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@12e14ebc
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:245)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
        	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        	at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
        	at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
        	at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
        	at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2028)
        	at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2351)
        	at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
        	at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
        	at javax.swing.AbstractButton.doClick(AbstractButton.java:389)
        	at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
        	at com.apple.laf.AquaMenuItemUI.doClick(AquaMenuItemUI.java:137)
        	at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
        	at java.awt.Component.processMouseEvent(Component.java:6373)
        	at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
        	at java.awt.Component.processEvent(Component.java:6138)
        	at java.awt.Container.processEvent(Container.java:2085)
        	at java.awt.Component.dispatchEventImpl(Component.java:4735)
        	at java.awt.Container.dispatchEventImpl(Container.java:2143)
        	at java.awt.Component.dispatchEvent(Component.java:4565)
        	at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4621)
        	at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4282)
        	at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4212)
        	at java.awt.Container.dispatchEventImpl(Container.java:2129)
        	at java.awt.Window.dispatchEventImpl(Window.java:2478)
        	at java.awt.Component.dispatchEvent(Component.java:4565)
        	at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:679)
        	at java.awt.EventQueue.access$000(EventQueue.java:85)
        	at java.awt.EventQueue$1.run(EventQueue.java:638)
        	at java.awt.EventQueue$1.run(EventQueue.java:636)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
        	at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
        	at java.awt.EventQueue$2.run(EventQueue.java:652)
        	at java.awt.EventQueue$2.run(EventQueue.java:650)
        	at java.security.AccessController.doPrivileged(Native Method)
        	at java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
        	at java.awt.EventQueue.dispatchEvent(EventQueue.java:649)
        	at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:296)
        	at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:211)
        	at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:201)
        	at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:196)
        	at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:188)
        	at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
        Caused by: java.lang.NullPointerException
        	at com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
        	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
        	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        	at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
        	at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
        	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        	at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
        	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        	at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        	at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
        	at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
        	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:159)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:243)
        	... 44 more
        

      The same file is parsed without any errors when not in GUI mode.

      1. TIKA-801.patch
        5 kB
        Michael McCandless
      2. FW Testing.msg
        25 kB
        Paul Hill

        Activity

        Hide
        Michael McCandless added a comment -

        This is happening because the OfficeParser is producing a 2nd body endElement (ie </body>) without a matching body (<body>) startElement.

        TIKA-715 would have caught this earlier... when I enable the asserts from there, and run TikaCLI to extract text from this doc, indeed I hit:

        Exception in thread "main" java.lang.AssertionError: end tag=body with no startElement
        	at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:219)
        	at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:270)
        	at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
        	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:159)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
        	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
        	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
        

        (There's also a 2nd mismatched </html> tag).

        But... I don't know why OfficeParser is producing a mismatched </body></html> for this document!

        Maybe, it's invoking a sub-parser but failing to wrap the ContentHandler with EndDocumentShieldingContentHandler? (OpenDocumentParser uses EndDocumentShieldingContentHandler for this same reason...).

        Show
        Michael McCandless added a comment - This is happening because the OfficeParser is producing a 2nd body endElement (ie </body>) without a matching body (<body>) startElement. TIKA-715 would have caught this earlier... when I enable the asserts from there, and run TikaCLI to extract text from this doc, indeed I hit: Exception in thread "main" java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:219) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:270) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:159) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) (There's also a 2nd mismatched </html> tag). But... I don't know why OfficeParser is producing a mismatched </body></html> for this document! Maybe, it's invoking a sub-parser but failing to wrap the ContentHandler with EndDocumentShieldingContentHandler? (OpenDocumentParser uses EndDocumentShieldingContentHandler for this same reason...).
        Hide
        Jukka Zitting added a comment -

        EndDocumentShieldingContentHandler

        IMHO we shouldn't be using the EDSCH mechanism. As noted by Nick in TIKA-646, the correct fix for cases like this would be to update the parsers to generate the metadata before they call endDocument. The EDSCH solution only fixes the symptoms but not the root cause of the problem.

        Show
        Jukka Zitting added a comment - EndDocumentShieldingContentHandler IMHO we shouldn't be using the EDSCH mechanism. As noted by Nick in TIKA-646 , the correct fix for cases like this would be to update the parsers to generate the metadata before they call endDocument. The EDSCH solution only fixes the symptoms but not the root cause of the problem.
        Hide
        Paul Hill added a comment -

        If I am having the same problem, as suggested by Mike McCandless on the user list. Then it is easy to reproduce. No attachments required. Just forward an e-mail 2 or 3 times to your self within Outlook, then copy and past onto your filesystem to create an msg file. My 1st example was from last year or older, so the latest Outlook is NOT required.

        Then drop onto Tika-app 1.0 (but not 0.7, 0.9, 0.10) and you get the following

        org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@97de276
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
        at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
        at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
        at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
        [...]
        Caused by: java.lang.NullPointerException
        at com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(Unknown Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(Unknown Source)
        at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
        at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
        at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
        at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
        at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:178)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 41 more

        Show
        Paul Hill added a comment - If I am having the same problem, as suggested by Mike McCandless on the user list. Then it is easy to reproduce. No attachments required. Just forward an e-mail 2 or 3 times to your self within Outlook, then copy and past onto your filesystem to create an msg file. My 1st example was from last year or older, so the latest Outlook is NOT required. Then drop onto Tika-app 1.0 (but not 0.7, 0.9, 0.10) and you get the following org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@97de276 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279) at org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94) at org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77) [...] Caused by: java.lang.NullPointerException at com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(Unknown Source) at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(Unknown Source) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519) at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:178) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 41 more
        Hide
        Paul Hill added a comment - - edited

        The attached file stack dumps in Tika 1.0 as of 2011-12-05 Tika-app-1.0 Release (not the nightly build).

        Show
        Paul Hill added a comment - - edited The attached file stack dumps in Tika 1.0 as of 2011-12-05 Tika-app-1.0 Release (not the nightly build).
        Hide
        Michael McCandless added a comment -

        Actually this isn't a problem of a parser outputting metadata after
        startDocument...

        The problem, for both of the test docs, is that the Outlook message
        has a chunk of RTF text and so OutlookExtractor recurses into the
        RTFParser.

        RTFParser then calls start/endDocument itself.

        I can fix this by having RTFParser expose a separate parse method,
        with control over whether or not it should call start/endDocument
        itself; that seems to fix these two test docs.

        However, if the Outlook message has an HTML chunk, it's also broken:
        try running TikaGUI on
        tika-parsers/src/test/resources/test-documents/testMSG_chinese.msg
        (that's an HTML Outlook message).

        How can/should we fix that one? It's tagsoup that's calling
        .endDocument...

        Show
        Michael McCandless added a comment - Actually this isn't a problem of a parser outputting metadata after startDocument... The problem, for both of the test docs, is that the Outlook message has a chunk of RTF text and so OutlookExtractor recurses into the RTFParser. RTFParser then calls start/endDocument itself. I can fix this by having RTFParser expose a separate parse method, with control over whether or not it should call start/endDocument itself; that seems to fix these two test docs. However, if the Outlook message has an HTML chunk, it's also broken: try running TikaGUI on tika-parsers/src/test/resources/test-documents/testMSG_chinese.msg (that's an HTML Outlook message). How can/should we fix that one? It's tagsoup that's calling .endDocument...
        Hide
        Jukka Zitting added a comment -

        See the org.apache.tika.sax.EmbeddedContentHandler class. It's explicitly designed for cases like this.
        The ParsingEmbeddedDocumentExtractor class has an example of how to use EmbeddedContentHandler.

        Show
        Jukka Zitting added a comment - See the org.apache.tika.sax.EmbeddedContentHandler class. It's explicitly designed for cases like this. The ParsingEmbeddedDocumentExtractor class has an example of how to use EmbeddedContentHandler.
        Hide
        Michael McCandless added a comment -

        See the org.apache.tika.sax.EmbeddedContentHandler class.

        Excellent!

        I did that (patch attached) and these RTF/HTML Outlook docs are now fine through TikaGUI.

        Show
        Michael McCandless added a comment - See the org.apache.tika.sax.EmbeddedContentHandler class. Excellent! I did that (patch attached) and these RTF/HTML Outlook docs are now fine through TikaGUI.
        Hide
        Paul Hill added a comment -

        Thanks Jukka and Michael. Your quick response is appreciated.

        Show
        Paul Hill added a comment - Thanks Jukka and Michael. Your quick response is appreciated.
        Hide
        Jukka Zitting added a comment -

        patch attached

        Looks good, +1.

        Show
        Jukka Zitting added a comment - patch attached Looks good, +1.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Andrzej Bialecki
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development