[TIKA-2037] Problems with email attachments - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.13
Fix Version/s: 1.14
Component/s: detector, parser
Labels:
None
Environment:

Eclipse, Java 8

Description

I stumbled across a couple of problems while parsing and extracting attachments from .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result.

What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

CameraCalibration.eml
20/Jul/16 15:44
77 kB
Eli Trucco
Exkursion.eml
20/Jul/16 15:44
690 kB
Eli Trucco

Activity

People

Assignee:: Unassigned

Reporter:: Eli Trucco

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Jul/16 15:42

Updated:: 26/Jul/16 12:16

Resolved:: 20/Jul/16 17:21