Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2122

Extract all email headers from Outlook .msg files into Metadata

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 2.0, 1.14
    • Component/s: parser
    • Labels:
      None

      Description

      Currently most email headers are not added to the Metadata when extracting Outlook .msg files.

      http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java

      The headers - msg.getHeaders() - are already being looped through as a way to estimate the date.

      All headers should be added to Metadata, using the name of the header with a prefix such as "raw-header:"

        Activity

        Hide
        ChrisKnott Chris Knott added a comment -

        Wow, thanks! Very fast turnaround.

        Show
        ChrisKnott Chris Knott added a comment - Wow, thanks! Very fast turnaround.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1117 (See https://builds.apache.org/job/Tika-trunk/1117/)
        TIKA-2122 : add all headers from MSG and RFC822 files (tallison: rev 8e819c3caf3ff3b0492f600b4193d1b3ee74f51b)

        • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
        • (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1117 (See https://builds.apache.org/job/Tika-trunk/1117/ ) TIKA-2122 : add all headers from MSG and RFC822 files (tallison: rev 8e819c3caf3ff3b0492f600b4193d1b3ee74f51b) (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x-windows #63 (See https://builds.apache.org/job/tika-2.x-windows/63/)
        TIKA-2122: Extract all headers from MSG/RFC822 (tallison: rev 30e03de89fd4b21cb91917c72aec12eede761be3)

        • (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
        • (edit) tika-parser-modules/tika-parser-web-module/pom.xml
        • (edit) tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java
        • (edit) tika-parser-modules/tika-parser-office-module/pom.xml
        • (edit) tika-parser-bundles/tika-parser-office-bundle/pom.xml
        • (edit) CHANGES.txt
        • (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java
        • (edit) tika-parser-modules/pom.xml
        • (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java
        • (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java
        • (edit) tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #63 (See https://builds.apache.org/job/tika-2.x-windows/63/ ) TIKA-2122 : Extract all headers from MSG/RFC822 (tallison: rev 30e03de89fd4b21cb91917c72aec12eede761be3) (edit) tika-parser-modules/tika-parser-office-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java (edit) tika-parser-modules/tika-parser-web-module/pom.xml (edit) tika-parser-modules/tika-parser-web-module/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java (edit) tika-parser-modules/tika-parser-office-module/pom.xml (edit) tika-parser-bundles/tika-parser-office-bundle/pom.xml (edit) CHANGES.txt (edit) tika-core/src/main/java/org/apache/tika/metadata/Message.java (edit) tika-parser-modules/pom.xml (edit) tika-core/src/test/java/org/apache/tika/TikaTest.java (edit) tika-parser-modules/tika-parser-office-module/src/test/java/org/apache/tika/parser/microsoft/OutlookParserTest.java (edit) tika-parser-modules/tika-parser-web-module/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Went with Message:Raw-Header: as the prefix. I ran this against 1.7k .msg files we had in our regression corpus. There are some small areas for improvement, but, overall this looks good. I was able to reuse mime4j's DecoderUtil.decodeEncodedWords to handle encoded values.

        Show
        tallison@mitre.org Tim Allison added a comment - Went with Message:Raw-Header: as the prefix. I ran this against 1.7k .msg files we had in our regression corpus. There are some small areas for improvement, but, overall this looks good. I was able to reuse mime4j's DecoderUtil.decodeEncodedWords to handle encoded values.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Raw headers extracted with counts from 1,721 .msg files in our regression corpus.

        Show
        tallison@mitre.org Tim Allison added a comment - Raw headers extracted with counts from 1,721 .msg files in our regression corpus.
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Er, how about mail:raw-header:?

        Show
        tallison@mitre.org Tim Allison added a comment - Er, how about mail:raw-header: ?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        We'll also have to start adding handling for encoding in headers:

        H: From: =?iso-8859-1?Q?L'=C9quipe_Microsoft_Outlook_Express?= <msoe@microsoft.com>
        H: To: "Nouvel utilisateur de Outlook Express"
        H: Subject: Microsoft Outlook Express 6
        H: Date: Thu, 5 Apr 2007 09:26:06 -0700
        H: MIME-Version: 1.0
        H: Content-Type: text/html;
        H: 	charset="iso-8859-1"
        H: Content-Transfer-Encoding: quoted-printable
        H: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028
        
        Show
        tallison@mitre.org Tim Allison added a comment - We'll also have to start adding handling for encoding in headers: H: From: =?iso-8859-1?Q?L'=C9quipe_Microsoft_Outlook_Express?= <msoe@microsoft.com> H: To: "Nouvel utilisateur de Outlook Express" H: Subject: Microsoft Outlook Express 6 H: Date: Thu, 5 Apr 2007 09:26:06 -0700 H: MIME-Version: 1.0 H: Content-Type: text/html; H: charset="iso-8859-1" H: Content-Transfer-Encoding: quoted-printable H: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Y, I think this is a really good idea with a prefix – partly because it will expose areas for further work in .msg, and as Nick Burch pointed out, we still need some volunteer energy on other properties within .msg.

        I suspect that folks interested in forensics would want both the raw headers and the other properties we might eventually pull out.

        For now, how about raw-email-header:?

        As an example of "areas for further work", it looks like POI is breaking headers on new lines or semi-colons? On one of our current test files, I've prepended each header with "H:":

        H: Microsoft Mail Internet Headers Version 2.0
        H: Received: from hq-ex3fe3.ptcnet.ptc.com ([132.253.201.67]) by HQ-MAIL3.ptcnet.ptc.com with Microsoft SMTPSVC(6.0.3790.3959);
        H: 	 Thu, 29 Jan 2009 14:17:10 -0500
        H: Received: from irp1.ptc.com ([12.11.148.83]) by hq-ex3fe3.ptcnet.ptc.com with Microsoft SMTPSVC(6.0.3790.3959);
        H: 	 Thu, 29 Jan 2009 14:17:10 -0500
        H: X-IronPort-Anti-Spam-Filtered: true
        H: X-IronPort-Anti-Spam-Result: AskBALePgUmM0wsCk2dsb2JhbACMeYZdPwEBAQEJCQoJEQWpcoEDjWwBAwEDhA0G
        H: X-IronPort-AV: E=Sophos;i="4.37,346,1231131600"; 
        H:    d="scan'208";a="51369639"
        
        Show
        tallison@mitre.org Tim Allison added a comment - Y, I think this is a really good idea with a prefix – partly because it will expose areas for further work in .msg, and as Nick Burch pointed out, we still need some volunteer energy on other properties within .msg. I suspect that folks interested in forensics would want both the raw headers and the other properties we might eventually pull out. For now, how about raw-email-header: ? As an example of "areas for further work", it looks like POI is breaking headers on new lines or semi-colons? On one of our current test files, I've prepended each header with "H:": H: Microsoft Mail Internet Headers Version 2.0 H: Received: from hq-ex3fe3.ptcnet.ptc.com ([132.253.201.67]) by HQ-MAIL3.ptcnet.ptc.com with Microsoft SMTPSVC(6.0.3790.3959); H: Thu, 29 Jan 2009 14:17:10 -0500 H: Received: from irp1.ptc.com ([12.11.148.83]) by hq-ex3fe3.ptcnet.ptc.com with Microsoft SMTPSVC(6.0.3790.3959); H: Thu, 29 Jan 2009 14:17:10 -0500 H: X-IronPort-Anti-Spam-Filtered: true H: X-IronPort-Anti-Spam-Result: AskBALePgUmM0wsCk2dsb2JhbACMeYZdPwEBAQEJCQoJEQWpcoEDjWwBAwEDhA0G H: X-IronPort-AV: E=Sophos;i="4.37,346,1231131600"; H: d="scan'208";a="51369639"
        Hide
        ChrisKnott Chris Knott added a comment -

        Sorry I am not particularly familiar with Tika or POI, just needed this feature for a current project - what do you mean by HMEF?

        My use case is needing to extract custom headers which start with "x-" - there's never going to be a way to do this properly I presume, because the headers could be anything.

        How about extracting just headers that start "x-" and prepending them with "custom-email-header:" or something?

        On another note, what's the easiest workaround for this at the moment?

        Show
        ChrisKnott Chris Knott added a comment - Sorry I am not particularly familiar with Tika or POI, just needed this feature for a current project - what do you mean by HMEF? My use case is needing to extract custom headers which start with "x-" - there's never going to be a way to do this properly I presume, because the headers could be anything. How about extracting just headers that start "x-" and prepending them with "custom-email-header:" or something? — On another note, what's the easiest workaround for this at the moment?
        Hide
        gagravarr Nick Burch added a comment -

        I'm not sure if we want to be dumping these raw into the Tika metadata - maybe we could do with a prefix though? (Would probably want syncing up with RFC822 and MBox parsers though for consistency)

        Also note that HMEF doesn't currently pull out all the possible properties from the MSG level (support for fixed-length properties is incomplete and in need of volunteer energy), so there may be more bits of metadata we could get from the MSG file "properly", which may negate some of the need for this. (Pending suitable POI work!)

        Show
        gagravarr Nick Burch added a comment - I'm not sure if we want to be dumping these raw into the Tika metadata - maybe we could do with a prefix though? (Would probably want syncing up with RFC822 and MBox parsers though for consistency) Also note that HMEF doesn't currently pull out all the possible properties from the MSG level (support for fixed-length properties is incomplete and in need of volunteer energy), so there may be more bits of metadata we could get from the MSG file "properly", which may negate some of the need for this. (Pending suitable POI work!)

          People

          • Assignee:
            Unassigned
            Reporter:
            ChrisKnott Chris Knott
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development