Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.16
-
None
-
None
Description
MBOX messages often get parsed into four documents:
a. The mbox file - outer container "/"
b. The actual email-- "/embedded-1"
c. The utf-8 text content of the email "/embedded-1/embedded-2"
d. The utf-8 html content of the email "/embedded-1/embedded-3"
entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body.
The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior.
Thanks!