Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2680

Email attachments to an email are not extracted

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.18
    • 2.7.0
    • None
    • None

    Description

      I have a number of email messages that contain other email messages as attachments (with multiple levels of nesting).

      The email attachments are parts with "Content-Type: message/rfc822" but are not being recognized as such.

      Attached is an example email, with the multiple levels of attachments:

      • Subject: Test email within email
        • Subject: Email within email test
          • Subject: Stand-up today

       

      I would like to see 3 separate emails parsed out (top level, 1st level attached email, 2nd level attached email), but I only get 1 email and 1 unnamed text attachment:

      $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
      [
      {
      "Author": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
      "Content-Length": "16649",
      "Content-Type": "message/rfc822",
      "Creation-Date": "2018-04-25T12:46:41Z",
      "Message-From": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
      "Message-To": [
      "fm.SAN Management Team <fm.SANManagementTeam@bank.com>",
      "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>"
      ],
      "Message:From-Email": "Henry.Van.der.Smith@bank.com",
      "Message:From-Name": "Smith Van der, H (Henry)",
      "Message:Raw-Header:Auto-Submitted": "auto-generated",
      "Message:Raw-Header:Content-Transfer-Encoding": "binary",
      "Message:Raw-Header:Keywords": "",
      "Message:Raw-Header:MIME-Version": "1.0",
      "Message:Raw-Header:Message-ID": "<ab2078ea-fd2f-4b28-bc8d-451916369b3c@journal.report.generator>",
      "Message:Raw-Header:Return-Path": "<>",
      "Message:Raw-Header:Sender": "<MicrosoftExchange329e71ec88ae4615bbc36ab6ce41109e@bank.com>",
      "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
      "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": "<0fab98cd190c41f199a25c73f78a2070@BSTS124002.eu.banknet.com>",
      "Message:Raw-Header:X-MS-Journal-Report": "",
      "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
      "Multipart-Subtype": "mixed",
      "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.mail.RFC822Parser"
      ],
      "X-TIKA:parse_time_millis": "325",
      "creator": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
      "dc:creator": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
      "dc:title": "Test email within email",
      "dcterms:created": "2018-04-25T12:46:41Z",
      "meta:author": "Smith Van der, H (Henry) <Henry.Van.der.Smith@bank.com>",
      "meta:creation-date": "2018-04-25T12:46:41Z",
      "resourceName": "nested.eml",
      "subject": "Test email within email"
      },
      {
      "Content-Encoding": "US-ASCII",
      "Content-Type": "text/plain; charset=US-ASCII",
      "Multipart-Boundary": "_004_8075737674787666767166806676697476787366657271727266777_",
      "Multipart-Subtype": "mixed",
      "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.txt.TXTParser"
      ],
      "X-TIKA:embedded_resource_path": "/embedded-1",
      "X-TIKA:parse_time_millis": "5",
      "embeddedResourceType": "ATTACHMENT"
      }
      ]
      
      

      Attachments

        1. main_email_in_outlook.jpg
          386 kB
          Tim Allison
        2. nested.eml
          16 kB
          Yury Kats
        3. pseudo-xml.xml
          22 kB
          Tim Allison
        4. TIKA-2680-1.eml-2.7.0-prerc1.json
          12 kB
          Tim Allison

        Issue Links

          Activity

            People

              tallison Tim Allison
              yurykats Yury Kats
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: