Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: detector, parser
    • Labels:
      None
    • Environment:

      Eclipse, Java 8

      Description

      I stumbled across a couple of problems while parsing and extracting attachments from .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result.

      What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).

      1. CameraCalibration.eml
        77 kB
        Eli Trucco
      2. Exkursion.eml
        690 kB
        Eli Trucco

        Activity

        Hide
        eli.trucco Eli Trucco added a comment -

        I attached two example files. Exkursion is always detected as text/html and CameraCalibration is correctly identified as rfc822 but the attachments are not extracted.

        Show
        eli.trucco Eli Trucco added a comment - I attached two example files. Exkursion is always detected as text/html and CameraCalibration is correctly identified as rfc822 but the attachments are not extracted.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for opening this. I'll take a look.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for opening this. I'll take a look.
        Hide
        eli.trucco Eli Trucco added a comment -

        Thanks, Tim! Another thing I noticed is, in the example code ExtractEmbeddedFiles.java (link above), the input stream should first be wrapped in TikaInputStream, otherwise RFC822 Parser will throw an exception because it doesn't support mark/reset.

        Show
        eli.trucco Eli Trucco added a comment - Thanks, Tim! Another thing I noticed is, in the example code ExtractEmbeddedFiles.java (link above), the input stream should first be wrapped in TikaInputStream, otherwise RFC822 Parser will throw an exception because it doesn't support mark/reset.
        Hide
        gagravarr Nick Burch added a comment -

        I've just tried with a 1.14 snapshot build, and both are detected as message/rfc822 there, so it looks like we may have already fixed at least part of this on trunk

        Attachment extraction is failing with a different error, from within Tika itself so likely our own fault:

        Exception in thread "main" org.apache.tika.exception.TikaException: Failed to parse an email message
        	at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:81)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191)
        	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480)
        	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)
        Caused by: java.io.IOException: mark/reset not supported
        	at java.io.InputStream.reset(InputStream.java:347)
        	at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:436)
        	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
        	at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1021)
        	at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:182)
        	at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
        	at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:78)
        	... 6 more
        
        Show
        gagravarr Nick Burch added a comment - I've just tried with a 1.14 snapshot build, and both are detected as message/rfc822 there, so it looks like we may have already fixed at least part of this on trunk Attachment extraction is failing with a different error, from within Tika itself so likely our own fault: Exception in thread "main" org.apache.tika.exception.TikaException: Failed to parse an email message at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:81) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:191) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:480) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145) Caused by: java.io.IOException: mark/reset not supported at java.io.InputStream.reset(InputStream.java:347) at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:436) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77) at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1021) at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:182) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133) at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:78) ... 6 more
        Hide
        eli.trucco Eli Trucco added a comment -

        Hi Nick,
        This line "Caused by: java.io.IOException: mark/reset not supported" I think is thrown because the input stream in the parseEmbedded method inside TikaCLI.java is not wrapped in TikaInputStream.

        Show
        eli.trucco Eli Trucco added a comment - Hi Nick, This line "Caused by: java.io.IOException: mark/reset not supported" I think is thrown because the input stream in the parseEmbedded method inside TikaCLI.java is not wrapped in TikaInputStream.
        Hide
        gagravarr Nick Burch added a comment -

        Fixed in 952fb54 along with a simpler unit test inspired by your files, thanks!

        Show
        gagravarr Nick Burch added a comment - Fixed in 952fb54 along with a simpler unit test inspired by your files, thanks!
        Hide
        eli.trucco Eli Trucco added a comment -

        Great. So after resolving the exception, you were able to extract the attachments from both files?

        Show
        eli.trucco Eli Trucco added a comment - Great. So after resolving the exception, you were able to extract the attachments from both files?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Any chance you could update 2.0, too? Thank you!

        Show
        tallison@mitre.org Tim Allison added a comment - Any chance you could update 2.0, too? Thank you!
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x #124 (See https://builds.apache.org/job/tika-2.x/124/)
        TIKA-2037 RFC822Parser should wrap the James InputStream of embedded (nick: rev 31374a39bae03bfc260f73662c133467637193f1)

        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java
        • tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
        • CHANGES.txt
        • tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
          TIKA-2037 Merge fixes for 2.x (nick: rev f89887d2fbaa3949c398095b37322208a3fd4c7a)
        • tika-parsers/src/test/resources/test-documents/testEmailWithPNGAtt.eml
        • tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
        • tika-test-resources/src/test/resources/test-documents/testEmailWithPNGAtt.eml
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x #124 (See https://builds.apache.org/job/tika-2.x/124/ ) TIKA-2037 RFC822Parser should wrap the James InputStream of embedded (nick: rev 31374a39bae03bfc260f73662c133467637193f1) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java CHANGES.txt tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java TIKA-2037 Merge fixes for 2.x (nick: rev f89887d2fbaa3949c398095b37322208a3fd4c7a) tika-parsers/src/test/resources/test-documents/testEmailWithPNGAtt.eml tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java tika-test-resources/src/test/resources/test-documents/testEmailWithPNGAtt.eml
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in tika-2.x-windows #28 (See https://builds.apache.org/job/tika-2.x-windows/28/)
        TIKA-2037 RFC822Parser should wrap the James InputStream of embedded (nick: rev 31374a39bae03bfc260f73662c133467637193f1)

        • CHANGES.txt
        • tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
        • tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java
        • tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
          TIKA-2037 Merge fixes for 2.x (nick: rev f89887d2fbaa3949c398095b37322208a3fd4c7a)
        • tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
        • tika-parsers/src/test/resources/test-documents/testEmailWithPNGAtt.eml
        • tika-test-resources/src/test/resources/test-documents/testEmailWithPNGAtt.eml
        Show
        hudson Hudson added a comment - FAILURE: Integrated in tika-2.x-windows #28 (See https://builds.apache.org/job/tika-2.x-windows/28/ ) TIKA-2037 RFC822Parser should wrap the James InputStream of embedded (nick: rev 31374a39bae03bfc260f73662c133467637193f1) CHANGES.txt tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/mbox/MboxParserTest.java tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java TIKA-2037 Merge fixes for 2.x (nick: rev f89887d2fbaa3949c398095b37322208a3fd4c7a) tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java tika-parsers/src/test/resources/test-documents/testEmailWithPNGAtt.eml tika-test-resources/src/test/resources/test-documents/testEmailWithPNGAtt.eml

          People

          • Assignee:
            Unassigned
            Reporter:
            eli.trucco Eli Trucco
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development