Description
The mbox parser code is overly optimistic. It parses the entire message looking for anything that matches a header pattern, wherever it occurs in a line!
It looks to me like the parsing logic is in desperate need of a refactor. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case?
Also, out of curiosity, why does the parser force Windows-1252 as the charset?
Attachments
Attachments
Issue Links
- relates to
-
TIKA-2478 RFC822 includes redundant copies of the text
- Resolved