Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2037

Problems with email attachments

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.13
    • Fix Version/s: 1.14
    • Component/s: detector, parser
    • Labels:
      None
    • Environment:

      Eclipse, Java 8

      Description

      I stumbled across a couple of problems while parsing and extracting attachments from .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result.

      What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).

        Attachments

        1. Exkursion.eml
          690 kB
          Eli Trucco
        2. CameraCalibration.eml
          77 kB
          Eli Trucco

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              eli.trucco Eli Trucco
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: