Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2685

Email attached to an undeliverable email report are not extracted

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.18
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I have a number of email messages that are reports of deliverable emails that contain the original email message as attachment.

      The original emails are parts with "Content-Type: message/rfc822" but are not being recognized as such.

      Attached is an example email:

      • Subject: Undeliverable: SRE Agent Out of Space Source:WindowsApp
        • Subject: Subject: SRE Agent Out of Space Source:WindowsApp

      I would like to see 2 separate emails parsed out (top level undeliverable report, 1st level attached original email), but I get 1 email and 2 unnamed text attachments:

      $ java -jar tika-app-1.18.jar -m -J  /tmp/undeliverable.eml | python -m json.tool
      [
          {
              "Author": "postmaster@bank.com",
              "Content-Length": "17356",
              "Content-Type": "message/rfc822",
              "Creation-Date": "2017-11-04T16:00:11Z",
              "Message-From": "postmaster@bank.com",
              "Message-To": "UATAlerting@logscape.com",
              "Message:From-Email": "postmaster@bank.com",
              "Message:Raw-Header:Auto-Submitted": "auto-generated",
              "Message:Raw-Header:MIME-Version": "1.0",
              "Message:Raw-Header:Message-ID": "<936a3c2c-49e5-46a0-b58c-151c024b80fe@journal.report.generator>",
              "Message:Raw-Header:Return-Path": "<>",
              "Message:Raw-Header:Sender": "<MicrosoftExchange329e71ec88ae4615bbc36ab6ce41109e@bank.com>",
              "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
              "Message:Raw-Header:X-MS-Exchange-Message-Is-Ndr": "",
              "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": "\t<1451b918-770a-4d83-b1f9-0c9c0668f1d6@BXTS124020.eu.banknet.com>",
              "Message:Raw-Header:X-MS-Journal-Report": "",
              "Multipart-Boundary": "_5a8d7320-7cd6-4c1b-8e30-9616634562b2_",
              "Multipart-Subtype": "mixed",
              "X-Parsed-By": [
                  "org.apache.tika.parser.DefaultParser",
                  "org.apache.tika.parser.mail.RFC822Parser"
              ],
              "X-TIKA:parse_time_millis": "326",
              "creator": "postmaster@bank.com",
              "dc:creator": "postmaster@bank.com",
              "dc:title": "Undeliverable: SRE Agent Out of Space Source:WindowsApp",
              "dcterms:created": "2017-11-04T16:00:11Z",
              "meta:author": "postmaster@bank.com",
              "meta:creation-date": "2017-11-04T16:00:11Z",
              "resourceName": "undeliverable.eml",
              "subject": "Undeliverable: SRE Agent Out of Space Source:WindowsApp"
          },
          {
              "Content-Encoding": "windows-1252",
              "Content-Type": "text/plain; charset=windows-1252",
              "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
              "Multipart-Subtype": "report",
              "X-Parsed-By": [
                  "org.apache.tika.parser.DefaultParser",
                  "org.apache.tika.parser.txt.TXTParser"
              ],
              "X-TIKA:embedded_resource_path": "/embedded-1",
              "X-TIKA:parse_time_millis": "4",
              "embeddedResourceType": "ATTACHMENT"
          },
          {
              "Content-Encoding": "US-ASCII",
              "Content-Type": "text/html; charset=US-ASCII",
              "Multipart-Boundary": "_dd8c2c7d-5333-4f9a-a282-d2056075e7aa_",
              "Multipart-Subtype": "report",
              "X-Parsed-By": [
                  "org.apache.tika.parser.DefaultParser",
                  "org.apache.tika.parser.html.HtmlParser"
              ],
              "X-TIKA:embedded_resource_path": "/embedded-2",
              "X-TIKA:parse_time_millis": "7",
              "embeddedResourceType": "ATTACHMENT"
          }
      ]
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tallison@apache.org Tim Allison
                Reporter:
                yurykats Yury Kats
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: