[TIKA-2478] RFC822 includes redundant copies of the text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.16
Fix Version/s: 1.17
Component/s: None
Labels:
None

Description

MBOX messages often get parsed into four documents:
a. The mbox file - outer container "/"
b. The actual email-- "/embedded-1"
c. The utf-8 text content of the email "/embedded-1/embedded-2"
d. The utf-8 html content of the email "/embedded-1/embedded-3"

entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body.

The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior.

Thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml
23/Oct/17 19:30
8 kB
Tim Allison
TIKA-2478.patch
24/Oct/17 19:22
80 kB
Tim Allison
mixed-with-pdf-inline
23/Oct/17 21:04
40 kB
Kenneth William Krugler
mixed-simple
23/Oct/17 21:04
2 kB
Kenneth William Krugler

Issue Links

is related to

TIKA-2471 Tab-prefixed message body lines in Mbox interpreted as headers

Open

TIKA-1788 message/rfc822 parser doesn't identify attachment filenames from Content-Disposition header

Resolved

relates to

TIKA-2547 RFC822 w multipart/mixed first text element should be treated as body, not attachment

Resolved

TIKA-2614 RFC822 treats non-multipart as attachment

Resolved

Activity

People

Assignee:: Tim Allison

Reporter:: Robert Letzler

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 17/Oct/17 00:30

Updated:: 03/Oct/18 23:02

Resolved:: 02/Nov/17 12:37